WebSockets Demystified - Part 4: Production and AWS
Series: Index | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6
Table of Contents
- The Horizontal Scaling Problem
- Redis Pub/Sub Broker Relay
- AWS Architecture Overview
- ALB WebSocket Configuration
- ECS Fargate Deployment
- AWS ElastiCache Redis Configuration
- Graceful Shutdown
- Health Checks and Monitoring
- CloudWatch Alerts and Dashboards
- Production Tips and Hardening
1. The Horizontal Scaling Problem
Why Horizontal Scaling Is Difficult
HTTP servers are stateless. Any server can answer any request. A load balancer can freely route requests round-robin.
WebSocket servers are stateful. Each server holds open TCP connections. Client A's session lives entirely on Server 1. If you route Client A's next message to Server 2, it has no idea who Client A is.
Without coordination:
Client A ----[connected to Server 1]
Client B ----[connected to Server 2]
Client A sends message intended for Client B:
Server 1 receives the message.
Server 1 cannot deliver to Client B because Client B's connection is on Server 2.
Message is lost.
With Redis Pub/Sub:
Server 1 receives message from Client A.
Server 1 publishes to Redis channel "room.general".
Server 2 is subscribed to "room.general".
Server 2 receives message from Redis.
Server 2 delivers it to Client B.
Message delivered!
Sticky Sessions (Partial Solution)
A simpler but incomplete solution is sticky sessions (also called session affinity): the load balancer always routes a client to the same server instance.
ALB with sticky sessions:
Client A --[stickiness cookie: srv1]--> always goes to Server 1
Client B --[stickiness cookie: srv2]--> always goes to Server 2
This ensures that a client's WebSocket connection is always handled by one server.
Problems with sticky sessions alone:
1. If Server 1 dies, all its connected clients lose their connection.
2. Load is not evenly distributed (Server 1 might get all active users).
3. Server-side events (scheduled jobs, DB triggers) cannot be routed to the right server.
Example: an order status update in the DB needs to be pushed to a user
on Server 1, but the scheduler ran on Server 2.
Sticky sessions are a fallback.
Redis Pub/Sub is the correct scalable solution.
2. Redis Pub/Sub Broker Relay
Spring's Built-in Redis Broker Relay
Spring WebSocket includes a StompBrokerRelay that connects to an external message broker (like Redis, RabbitMQ, or ActiveMQ). Messages go through the external broker, enabling cross-server delivery.
However, for pure Redis Pub/Sub integration (without a full STOMP broker like RabbitMQ), the cleanest approach is to publish messages to Redis from Spring and subscribe on all server instances.
Custom Redis Relay Implementation
// src/main/java/com/example/ws/relay/RedisMessageRelay.java
package com.example.ws.relay;
import com.example.ws.dto.RelayMessage;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.data.redis.connection.Message;
import org.springframework.data.redis.connection.MessageListener;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.messaging.simp.SimpMessagingTemplate;
import org.springframework.stereotype.Component;
/**
* Relays messages across multiple WebSocket server instances via Redis Pub/Sub.
*
* HOW IT WORKS:
* When Server 1 needs to broadcast to /topic/room.123:
* 1. It publishes to Redis channel "ws:topic:room.123"
* 2. All server instances (including Server 1) are subscribed to this channel
* 3. All servers receive the message from Redis
* 4. All servers deliver it to their locally connected clients subscribed to /topic/room.123
*
* This ensures every server delivers the message to its own connected clients.
* The originating server also delivers it locally (its own clients also get it).
*/
@Component
@RequiredArgsConstructor
@Slf4j
public class RedisMessageRelay implements MessageListener {
private final SimpMessagingTemplate messagingTemplate;
private final ObjectMapper objectMapper;
private final StringRedisTemplate redisTemplate;
// Channel prefix in Redis
public static final String CHANNEL_PREFIX = "ws:";
/**
* Publishes a message to Redis so all server instances receive it.
* Called before sending to SimpMessagingTemplate directly.
*/
public void publish(String destination, Object payload) {
try {
RelayMessage relayMessage = new RelayMessage();
relayMessage.setDestination(destination);
relayMessage.setPayload(objectMapper.writeValueAsString(payload));
relayMessage.setPayloadType(payload.getClass().getName());
String json = objectMapper.writeValueAsString(relayMessage);
String channel = CHANNEL_PREFIX + destination;
redisTemplate.convertAndSend(channel, json);
} catch (Exception e) {
log.error("Failed to publish message to Redis: {}", e.getMessage());
}
}
/**
* Publishes a user-specific message to Redis.
*/
public void publishToUser(String userId, String queueDestination, Object payload) {
try {
RelayMessage relayMessage = new RelayMessage();
relayMessage.setDestination(queueDestination);
relayMessage.setUserId(userId);
relayMessage.setPayload(objectMapper.writeValueAsString(payload));
relayMessage.setPayloadType(payload.getClass().getName());
String json = objectMapper.writeValueAsString(relayMessage);
// Use user-specific channel so we can be selective
String channel = CHANNEL_PREFIX + "user." + userId + queueDestination;
redisTemplate.convertAndSend(channel, json);
} catch (Exception e) {
log.error("Failed to publish user message to Redis: {}", e.getMessage());
}
}
/**
* This method is called when THIS server instance receives a message from Redis.
* It then delivers it to locally connected clients.
*/
@Override
public void onMessage(Message message, byte[] pattern) {
try {
String json = new String(message.getBody());
RelayMessage relayMessage = objectMapper.readValue(json, RelayMessage.class);
if (relayMessage.getUserId() != null) {
// User-specific message
messagingTemplate.convertAndSendToUser(
relayMessage.getUserId(),
relayMessage.getDestination(),
relayMessage.getPayload()
);
} else {
// Topic broadcast
messagingTemplate.convertAndSend(
relayMessage.getDestination(),
relayMessage.getPayload()
);
}
} catch (Exception e) {
log.error("Failed to process Redis relay message: {}", e.getMessage());
}
}
}Redis Message Listener Configuration
// src/main/java/com/example/ws/config/RedisConfig.java
package com.example.ws.config;
import com.example.ws.relay.RedisMessageRelay;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.redis.connection.RedisConnectionFactory;
import org.springframework.data.redis.listener.PatternTopic;
import org.springframework.data.redis.listener.RedisMessageListenerContainer;
import org.springframework.data.redis.listener.adapter.MessageListenerAdapter;
@Configuration
public class RedisConfig {
/**
* Creates a Redis message listener container that subscribes to
* all WebSocket relay channels matching the pattern "ws:*".
*
* When any message is published to a "ws:..." channel,
* the RedisMessageRelay.onMessage() method is called.
*/
@Bean
public RedisMessageListenerContainer redisListenerContainer(
RedisConnectionFactory connectionFactory,
RedisMessageRelay relay) {
RedisMessageListenerContainer container = new RedisMessageListenerContainer();
container.setConnectionFactory(connectionFactory);
// Subscribe to ALL WebSocket relay channels using a pattern
// This server instance will receive all messages published to ws:* channels
container.addMessageListener(
new MessageListenerAdapter(relay),
new PatternTopic(RedisMessageRelay.CHANNEL_PREFIX + "*")
);
return container;
}
}Using the Relay in Services
// In ChatService - use relay instead of directly using messagingTemplate for broadcasts
@Service
@RequiredArgsConstructor
@Slf4j
public class ChatService {
private final RedisMessageRelay relay;
private final SimpMessagingTemplate messagingTemplate; // Still needed for local ops
// ...
@Transactional
public void processAndBroadcast(String userId, String roomId, ChatMessageRequest request) {
// Save to DB
ChatMessage saved = messageRepository.save(buildMessage(userId, roomId, request));
ChatMessageResponse response = buildResponse(saved, userId);
// Publish via Redis relay - all server instances will deliver locally
relay.publish("/topic/room." + roomId, response);
}
public void sendPrivateMessage(String targetUserId, ChatMessageResponse response) {
// For user-specific messages, relay handles routing to correct server
relay.publishToUser(targetUserId, "/queue/private", response);
}
}Relay Message DTO
// src/main/java/com/example/ws/dto/RelayMessage.java
package com.example.ws.dto;
import lombok.Data;
@Data
public class RelayMessage {
private String destination;
private String userId; // null for topic broadcasts, set for user-specific messages
private String payload; // JSON serialized payload
private String payloadType;
private long timestamp = System.currentTimeMillis();
private String originServerId; // For deduplication if needed
}3. AWS Architecture Overview
Production Architecture Diagram
Internet
|
+------------------------+
| AWS ALB (HTTPS/WSS) |
| - SSL Termination |
| - Stickiness Cookie |
| - WS Upgrade Pass |
+------------------------+
| |
+-------------+ +-------------+
| |
+-----------------+ +-----------------+
| ECS Task | | ECS Task |
| WebSocket |<--- Redis Pub/Sub --->| WebSocket |
| Spring Boot | | | Spring Boot |
| Instance 1 | | | Instance 2 |
+-----------------+ | +-----------------+
| | |
| +---------------------------+ |
+--> | AWS ElastiCache Redis | <-------+
| (Cluster Mode Disabled) |
| Primary + Replica |
+---------------------------+
|
+---------------------------+
| Amazon RDS MySQL 8 |
| Multi-AZ, read replicas |
+---------------------------+
|
+---------------------------+
| Amazon CloudWatch |
| - WS connection count |
| - Messages per second |
| - Error rate |
+---------------------------+
Component Responsibilities
| Component | Role |
|---|---|
| ALB | SSL termination, WebSocket upgrade pass-through, sticky sessions, health checks |
| ECS Fargate | Runs Spring Boot WebSocket containers, auto-scaling based on connection count |
| ElastiCache Redis | Cross-instance message relay, presence tracking, session storage |
| RDS MySQL | Persistent message storage, user data, room metadata |
| CloudWatch | Metrics, alarms, dashboards for operational visibility |
4. ALB WebSocket Configuration
Key ALB Settings for WebSocket
ALB Target Group Settings for WebSocket:
Protocol: HTTP (ALB terminates SSL, forwards HTTP to containers)
Port: 8080
Health Check Path: /actuator/health
Health Check Protocol: HTTP
Stickiness: Enabled
Stickiness Type: lb_cookie
Stickiness Duration: 86400 seconds (24 hours)
Why stickiness even with Redis?
- WebSocket upgrade requests: The initial HTTP upgrade must go to the same server
where the WebSocket session will be maintained.
- After connection is established, the TCP connection is pinned to the server anyway.
- Stickiness ensures reconnection attempts also go back to the same server.
- Redis handles the case where the server dies and the client reconnects to a new server.
ALB Timeout Configuration
ALB Idle Timeout: 3600 seconds (1 hour)
DEFAULT ALB IDLE TIMEOUT IS 60 SECONDS.
This is a critical production gotcha.
WebSocket connections are idle during periods of low traffic (heartbeats only).
If ALB idle timeout < heartbeat interval, ALB will kill WebSocket connections.
Set idle timeout to at least 1 hour for WebSocket workloads.
Your heartbeat interval (25 seconds) keeps the connection active.
Terraform Configuration for ALB
# infrastructure/alb.tf
resource "aws_lb_listener" "websocket_https" {
load_balancer_arn = aws_lb.main.arn
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = aws_acm_certificate.main.arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.websocket.arn
}
}
resource "aws_lb_target_group" "websocket" {
name = "websocket-tg"
port = 8080
protocol = "HTTP"
vpc_id = aws_vpc.main.id
target_type = "ip" # Required for ECS Fargate
health_check {
path = "/actuator/health"
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 15
matcher = "200"
}
stickiness {
type = "lb_cookie"
cookie_duration = 86400
enabled = true
}
# Critical: Increase idle timeout for WebSocket connections
# Default is 60 seconds which will kill WebSocket connections
lifecycle {
create_before_destroy = true
}
}
resource "aws_lb" "main" {
name = "websocket-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.alb.id]
subnets = aws_subnet.public[*].id
# CRITICAL: Set idle timeout high for WebSocket workloads
idle_timeout = 3600
access_logs {
bucket = aws_s3_bucket.alb_logs.id
enabled = true
}
}5. ECS Fargate Deployment
ECS Task Definition
// ecs-task-definition.json
{
"family": "websocket-service",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "1024",
"memory": "2048",
"executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/websocketTaskRole",
"containerDefinitions": [
{
"name": "websocket-service",
"image": "ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/websocket-service:latest",
"portMappings": [
{
"containerPort": 8080,
"protocol": "tcp"
}
],
"environment": [
{
"name": "SPRING_PROFILES_ACTIVE",
"value": "production"
},
{
"name": "JAVA_OPTS",
"value": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xms512m -Xmx1536m -XX:+ExitOnOutOfMemoryError"
}
],
"secrets": [
{
"name": "DB_USERNAME",
"valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:websocket/db-username"
},
{
"name": "DB_PASSWORD",
"valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:websocket/db-password"
},
{
"name": "REDIS_HOST",
"valueFrom": "arn:aws:ssm:us-east-1:ACCOUNT_ID:parameter/websocket/redis-host"
},
{
"name": "JWT_SECRET",
"valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:websocket/jwt-secret"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/websocket-service",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"healthCheck": {
"command": [
"CMD-SHELL",
"curl -f http://localhost:8080/actuator/health || exit 1"
],
"interval": 15,
"timeout": 5,
"retries": 3,
"startPeriod": 60
},
"ulimits": [
{
"name": "nofile",
"softLimit": 65536,
"hardLimit": 65536
}
]
}
]
}Auto-Scaling Configuration
# infrastructure/autoscaling.tf
# Scale on active WebSocket connections (custom CloudWatch metric)
resource "aws_appautoscaling_policy" "websocket_connections" {
name = "websocket-connection-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.websocket.resource_id
scalable_dimension = aws_appautoscaling_target.websocket.scalable_dimension
service_namespace = aws_appautoscaling_target.websocket.service_namespace
target_tracking_scaling_policy_configuration {
# Target: keep average active connections per task below 5000
# When average exceeds 5000, scale out. When below 3500, scale in.
target_value = 5000
customized_metric_specification {
metric_name = "websocket.connections.active"
namespace = "WebSocketService"
statistic = "Average"
}
scale_in_cooldown = 300 # 5 minutes - be slow to scale in (don't drop connections)
scale_out_cooldown = 60 # 1 minute - be quick to scale out
}
}
resource "aws_appautoscaling_target" "websocket" {
max_capacity = 20
min_capacity = 2
resource_id = "service/main-cluster/websocket-service"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}Dockerfile
# Dockerfile
FROM eclipse-temurin:17-jre-alpine AS runner
# Create non-root user for security
RUN addgroup -S wsapp && adduser -S wsapp -G wsapp
WORKDIR /app
# Copy the fat jar
COPY target/websocket-service-*.jar app.jar
# Set ownership
RUN chown wsapp:wsapp /app/app.jar
USER wsapp
# JVM options for container awareness
# UseContainerSupport ensures JVM respects container memory limits
ENV JAVA_OPTS="-XX:+UseContainerSupport \
-XX:MaxRAMPercentage=75.0 \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:+ExitOnOutOfMemoryError \
-Djava.security.egd=file:/dev/./urandom"
EXPOSE 8080
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar /app/app.jar"]6. AWS ElastiCache Redis Configuration
ElastiCache Setup for WebSocket
# infrastructure/elasticache.tf
# Use Redis (non-cluster mode) for WebSocket pub/sub
# Cluster mode complicates pub/sub channel routing
resource "aws_elasticache_replication_group" "websocket" {
replication_group_id = "websocket-redis"
description = "Redis for WebSocket pub/sub and presence"
engine = "redis"
engine_version = "7.0"
node_type = "cache.r6g.large" # 13 GB RAM, good for pub/sub
# Primary + 1 replica for HA
num_cache_clusters = 2
# Multi-AZ for high availability
multi_az_enabled = true
automatic_failover_enabled = true
# Encryption at rest and in transit
at_rest_encryption_enabled = true
transit_encryption_enabled = true
subnet_group_name = aws_elasticache_subnet_group.main.name
security_group_ids = [aws_security_group.redis.id]
# Maintenance window
maintenance_window = "sun:05:00-sun:06:00"
# Backup
snapshot_retention_limit = 7
snapshot_window = "03:00-04:00"
# Parameter group for optimizing pub/sub
parameter_group_name = aws_elasticache_parameter_group.websocket.name
}
resource "aws_elasticache_parameter_group" "websocket" {
name = "websocket-redis-params"
family = "redis7"
# Tune for pub/sub workloads
parameter {
name = "hz"
value = "20" # Increase server cron frequency (default 10)
}
parameter {
name = "tcp-keepalive"
value = "60"
}
parameter {
name = "timeout"
value = "300" # Close idle connections after 5 minutes
}
# Pub/sub notification settings
parameter {
name = "notify-keyspace-events"
value = "" # Disable keyspace events unless needed (saves CPU)
}
}Spring Redis Configuration for Production
# application-production.yml
spring:
data:
redis:
host: ${REDIS_HOST}
port: 6379
ssl:
enabled: true
timeout: 2000ms
lettuce:
pool:
max-active: 20 # Max connections in the pool
max-idle: 10 # Max idle connections
min-idle: 5 # Min idle connections kept alive
max-wait: 2000ms # Max wait for a connection from pool
shutdown-timeout: 100ms
cluster:
refresh:
adaptive: true # Dynamically refresh cluster topology
period: 60s7. Graceful Shutdown
Why Graceful Shutdown Matters
When ECS updates a task (new deployment), it sends SIGTERM to the running container. Without graceful shutdown:
- All WebSocket connections are immediately dropped.
- Clients receive an abrupt disconnect.
- Clients attempt reconnection - thousands simultaneously, causing a reconnection thundering herd.
- Messages in flight are lost.
With graceful shutdown:
- New connections stop being accepted.
- Existing connections are notified to reconnect.
- Connections are given time to drain.
- Service shuts down cleanly.
Graceful Shutdown Implementation
// src/main/java/com/example/ws/lifecycle/GracefulShutdownHandler.java
package com.example.ws.lifecycle;
import com.example.ws.dto.ServerShutdownNotice;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.ApplicationListener;
import org.springframework.context.event.ContextClosingEvent;
import org.springframework.messaging.simp.SimpMessagingTemplate;
import org.springframework.stereotype.Component;
import java.util.concurrent.TimeUnit;
/**
* Handles graceful shutdown of WebSocket connections.
*
* Steps:
* 1. Broadcast shutdown notice to all connected clients (they should reconnect elsewhere)
* 2. Wait for in-flight messages to complete
* 3. Allow time for clients to reconnect to other servers
*/
@Component
@RequiredArgsConstructor
@Slf4j
public class GracefulShutdownHandler implements ApplicationListener<ContextClosingEvent> {
private final SimpMessagingTemplate messagingTemplate;
@Override
public void onApplicationEvent(ContextClosingEvent event) {
log.info("Initiating graceful WebSocket shutdown...");
try {
// 1. Notify all connected clients that this server is shutting down
// Clients should interpret this as a signal to reconnect
ServerShutdownNotice notice = new ServerShutdownNotice(
"Server maintenance. Please reconnect in a few seconds.",
30 // Reconnect in 30 seconds to allow other servers to be ready
);
messagingTemplate.convertAndSend("/topic/server.events", notice);
// 2. Wait for message to be delivered to clients
TimeUnit.SECONDS.sleep(2);
// 3. Wait for active requests to drain
// Spring's actual graceful shutdown (server.shutdown=graceful) handles this
// at the HTTP layer. We add extra time here for WebSocket draining.
TimeUnit.SECONDS.sleep(5);
log.info("Graceful WebSocket shutdown complete.");
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
log.warn("Graceful shutdown interrupted");
}
}
}Application Properties for Graceful Shutdown
# application.yml
server:
shutdown: graceful # Wait for in-flight requests before shutting down
spring:
lifecycle:
timeout-per-shutdown-phase: 30s # Allow up to 30 seconds for shutdownEntrypoint with Pre-Stop Hook (ECS)
In ECS task definitions, configure a stopTimeout to align with Spring's graceful shutdown:
// In ECS task definition
{
"stopTimeout": 60
}8. Health Checks and Monitoring
Custom WebSocket Health Indicator
// src/main/java/com/example/ws/health/WebSocketHealthIndicator.java
package com.example.ws.health;
import com.example.ws.listener.WebSocketEventListener;
import lombok.RequiredArgsConstructor;
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Component;
@Component("websocket")
@RequiredArgsConstructor
public class WebSocketHealthIndicator implements HealthIndicator {
private final StringRedisTemplate redisTemplate;
@Override
public Health health() {
try {
// Check Redis connectivity (critical for pub/sub)
String pong = redisTemplate.getConnectionFactory()
.getConnection()
.ping();
if (!"PONG".equals(pong)) {
return Health.down()
.withDetail("redis", "Unexpected ping response: " + pong)
.build();
}
return Health.up()
.withDetail("redis", "Connected")
.build();
} catch (Exception e) {
return Health.down()
.withDetail("error", e.getMessage())
.build();
}
}
}Metrics Configuration with Micrometer
// src/main/java/com/example/ws/metrics/WebSocketMetrics.java
package com.example.ws.metrics;
import io.micrometer.core.instrument.*;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;
@Component
@Slf4j
public class WebSocketMetrics {
private final AtomicInteger activeConnections = new AtomicInteger(0);
private final AtomicLong totalMessagesReceived = new AtomicLong(0);
private final AtomicLong totalMessagesSent = new AtomicLong(0);
private final AtomicInteger errorCount = new AtomicInteger(0);
private final Timer messageProcessingTimer;
private final Counter connectionErrorCounter;
public WebSocketMetrics(MeterRegistry registry) {
// Gauge - reports current value
Gauge.builder("websocket.connections.active", activeConnections, AtomicInteger::get)
.description("Active WebSocket connections")
.register(registry);
Gauge.builder("websocket.messages.received.total", totalMessagesReceived, AtomicLong::get)
.description("Total WebSocket messages received since startup")
.register(registry);
Gauge.builder("websocket.messages.sent.total", totalMessagesSent, AtomicLong::get)
.description("Total WebSocket messages sent since startup")
.register(registry);
// Timer - tracks latency distribution
messageProcessingTimer = Timer.builder("websocket.message.processing.time")
.description("Time to process and broadcast a WebSocket message")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
// Counter - increments
connectionErrorCounter = Counter.builder("websocket.connection.errors")
.description("WebSocket connection errors")
.register(registry);
}
public void connectionOpened() { activeConnections.incrementAndGet(); }
public void connectionClosed() { activeConnections.decrementAndGet(); }
public void messageReceived() { totalMessagesReceived.incrementAndGet(); }
public void messageSent() { totalMessagesSent.incrementAndGet(); }
public void connectionError() { connectionErrorCounter.increment(); }
public Timer.Sample startMessageTimer() { return Timer.start(); }
public void stopMessageTimer(Timer.Sample sample) { sample.stop(messageProcessingTimer); }
}9. CloudWatch Alerts and Dashboards
CloudWatch Alarms (Terraform)
# infrastructure/cloudwatch_alarms.tf
# Alert when active connections exceed 80% of max capacity
resource "aws_cloudwatch_metric_alarm" "high_connections" {
alarm_name = "websocket-high-connections"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "3"
metric_name = "websocket.connections.active"
namespace = "WebSocketService"
period = "60"
statistic = "Average"
threshold = "8000" # 80% of 10000 max
alarm_description = "WebSocket connections nearing capacity. Scale out needed."
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Alert on high error rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "websocket-high-errors"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "websocket.connection.errors"
namespace = "WebSocketService"
period = "300"
statistic = "Sum"
threshold = "100"
alarm_description = "WebSocket error rate is elevated"
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Alert on low connection count (service may be down)
resource "aws_cloudwatch_metric_alarm" "low_connections" {
alarm_name = "websocket-low-connections"
comparison_operator = "LessThanThreshold"
evaluation_periods = "5"
metric_name = "websocket.connections.active"
namespace = "WebSocketService"
period = "60"
statistic = "Average"
threshold = "10"
alarm_description = "Unexpectedly low WebSocket connections - service may be unhealthy"
alarm_actions = [aws_sns_topic.critical_alerts.arn]
}
# ALB unhealthy hosts
resource "aws_cloudwatch_metric_alarm" "unhealthy_hosts" {
alarm_name = "websocket-unhealthy-hosts"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "UnHealthyHostCount"
namespace = "AWS/ApplicationELB"
period = "60"
statistic = "Average"
threshold = "0"
dimensions = {
TargetGroup = aws_lb_target_group.websocket.arn_suffix
LoadBalancer = aws_lb.main.arn_suffix
}
alarm_actions = [aws_sns_topic.critical_alerts.arn]
}10. Production Tips and Hardening
Connection Limit Configuration
# Linux OS-level file descriptor limits
# WebSocket connections consume file descriptors
# Default Linux limit is 1024 per process - WAY too low for WebSocket servers
# In /etc/security/limits.conf:
# ec2-user soft nofile 65536
# ec2-user hard nofile 65536
# In /etc/sysctl.conf:
# fs.file-max = 1000000
# net.core.somaxconn = 65536
# net.ipv4.tcp_max_syn_backlog = 65536
# net.core.netdev_max_backlog = 65536JVM Tuning for WebSocket Servers
# JVM flags for a WebSocket server with 10,000+ connections
JAVA_OPTS="\
# G1GC is best for WebSocket servers (low pause times)
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=200 \
-XX:G1HeapRegionSize=16m \
# Heap: WebSocket connections consume memory for buffers
# Each connection uses approx 50-100KB
# 10,000 connections = 500MB to 1GB heap for buffers alone
-Xms1g \
-Xmx3g \
# Exit on OOM instead of limping along in a broken state
-XX:+ExitOnOutOfMemoryError \
# Container awareness
-XX:+UseContainerSupport \
-XX:MaxRAMPercentage=75.0 \
# Network performance
-Djava.net.preferIPv4Stack=true \
# Better random number generation (important for SSL)
-Djava.security.egd=file:/dev/./urandom"Network Tuning Tips
# Tune Tomcat for WebSocket workloads (application.yml)
server:
tomcat:
# Max simultaneous WebSocket connections (Tomcat NIO)
max-connections: 10000
# Accept queue size - connections waiting to be accepted
accept-count: 200
threads:
max: 200 # Thread pool for handling HTTP (not WebSocket connections)
min-spare: 20
connection-timeout: 20000
# Keep-alive timeout for HTTP (not WebSocket)
keep-alive-timeout: 20000Security Hardening Checklist
Production WebSocket Security Checklist:
[ ] Use WSS (WebSocket Secure) only - never plain WS in production
[ ] Validate JWT on EVERY message, not just on connection
[ ] Set explicit allowed origins - never use wildcard "*"
[ ] Rate limit messages per connection
[ ] Limit maximum message size
[ ] Validate all incoming payloads with Bean Validation (@Valid)
[ ] Implement connection limits per user (prevent one user from opening 1000 connections)
[ ] Log all authentication failures
[ ] Monitor for unusual traffic patterns (DDoS on WebSocket endpoints)
[ ] Rotate JWT secrets regularly
[ ] Use AWS WAF with ALB to block malicious IPs
[ ] Enable ALB access logs for audit trail
[ ] Use VPC security groups to allow only ALB to reach ECS containers
[ ] Store secrets in AWS Secrets Manager, not environment variables
[ ] Enable CloudWatch logs with log retention policy