WebSockets Demystified - Part 4: Production and AWS

Series: Index | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6

The Horizontal Scaling Problem
Redis Pub/Sub Broker Relay
AWS Architecture Overview
ALB WebSocket Configuration
ECS Fargate Deployment
AWS ElastiCache Redis Configuration
Graceful Shutdown
Health Checks and Monitoring
CloudWatch Alerts and Dashboards
Production Tips and Hardening

1. The Horizontal Scaling Problem

Why Horizontal Scaling Is Difficult

HTTP servers are stateless. Any server can answer any request. A load balancer can freely route requests round-robin.

WebSocket servers are stateful. Each server holds open TCP connections. Client A's session lives entirely on Server 1. If you route Client A's next message to Server 2, it has no idea who Client A is.

Without coordination:

Client A ----[connected to Server 1]
Client B ----[connected to Server 2]

Client A sends message intended for Client B:
Server 1 receives the message.
Server 1 cannot deliver to Client B because Client B's connection is on Server 2.
Message is lost.

With Redis Pub/Sub:

Server 1 receives message from Client A.
Server 1 publishes to Redis channel "room.general".
Server 2 is subscribed to "room.general".
Server 2 receives message from Redis.
Server 2 delivers it to Client B.
Message delivered!

Sticky Sessions (Partial Solution)

A simpler but incomplete solution is sticky sessions (also called session affinity): the load balancer always routes a client to the same server instance.

ALB with sticky sessions:
  Client A --[stickiness cookie: srv1]--> always goes to Server 1
  Client B --[stickiness cookie: srv2]--> always goes to Server 2

This ensures that a client's WebSocket connection is always handled by one server.

Problems with sticky sessions alone:
1. If Server 1 dies, all its connected clients lose their connection.
2. Load is not evenly distributed (Server 1 might get all active users).
3. Server-side events (scheduled jobs, DB triggers) cannot be routed to the right server.
   Example: an order status update in the DB needs to be pushed to a user
   on Server 1, but the scheduler ran on Server 2.

Sticky sessions are a fallback.
Redis Pub/Sub is the correct scalable solution.

2. Redis Pub/Sub Broker Relay

Spring's Built-in Redis Broker Relay

Spring WebSocket includes a StompBrokerRelay that connects to an external message broker (like Redis, RabbitMQ, or ActiveMQ). Messages go through the external broker, enabling cross-server delivery.

However, for pure Redis Pub/Sub integration (without a full STOMP broker like RabbitMQ), the cleanest approach is to publish messages to Redis from Spring and subscribe on all server instances.

Custom Redis Relay Implementation

// src/main/java/com/example/ws/relay/RedisMessageRelay.java
 
package com.example.ws.relay;
 
import com.example.ws.dto.RelayMessage;
import com.fasterxml.jackson.databind.ObjectMapper;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.data.redis.connection.Message;
import org.springframework.data.redis.connection.MessageListener;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.messaging.simp.SimpMessagingTemplate;
import org.springframework.stereotype.Component;
 
/**
 * Relays messages across multiple WebSocket server instances via Redis Pub/Sub.
 *
 * HOW IT WORKS:
 * When Server 1 needs to broadcast to /topic/room.123:
 *   1. It publishes to Redis channel "ws:topic:room.123"
 *   2. All server instances (including Server 1) are subscribed to this channel
 *   3. All servers receive the message from Redis
 *   4. All servers deliver it to their locally connected clients subscribed to /topic/room.123
 *
 * This ensures every server delivers the message to its own connected clients.
 * The originating server also delivers it locally (its own clients also get it).
 */
@Component
@RequiredArgsConstructor
@Slf4j
public class RedisMessageRelay implements MessageListener {
 
    private final SimpMessagingTemplate messagingTemplate;
    private final ObjectMapper objectMapper;
    private final StringRedisTemplate redisTemplate;
 
    // Channel prefix in Redis
    public static final String CHANNEL_PREFIX = "ws:";
 
    /**
     * Publishes a message to Redis so all server instances receive it.
     * Called before sending to SimpMessagingTemplate directly.
     */
    public void publish(String destination, Object payload) {
        try {
            RelayMessage relayMessage = new RelayMessage();
            relayMessage.setDestination(destination);
            relayMessage.setPayload(objectMapper.writeValueAsString(payload));
            relayMessage.setPayloadType(payload.getClass().getName());
 
            String json = objectMapper.writeValueAsString(relayMessage);
            String channel = CHANNEL_PREFIX + destination;
 
            redisTemplate.convertAndSend(channel, json);
 
        } catch (Exception e) {
            log.error("Failed to publish message to Redis: {}", e.getMessage());
        }
    }
 
    /**
     * Publishes a user-specific message to Redis.
     */
    public void publishToUser(String userId, String queueDestination, Object payload) {
        try {
            RelayMessage relayMessage = new RelayMessage();
            relayMessage.setDestination(queueDestination);
            relayMessage.setUserId(userId);
            relayMessage.setPayload(objectMapper.writeValueAsString(payload));
            relayMessage.setPayloadType(payload.getClass().getName());
 
            String json = objectMapper.writeValueAsString(relayMessage);
            // Use user-specific channel so we can be selective
            String channel = CHANNEL_PREFIX + "user." + userId + queueDestination;
 
            redisTemplate.convertAndSend(channel, json);
 
        } catch (Exception e) {
            log.error("Failed to publish user message to Redis: {}", e.getMessage());
        }
    }
 
    /**
     * This method is called when THIS server instance receives a message from Redis.
     * It then delivers it to locally connected clients.
     */
    @Override
    public void onMessage(Message message, byte[] pattern) {
        try {
            String json = new String(message.getBody());
            RelayMessage relayMessage = objectMapper.readValue(json, RelayMessage.class);
 
            if (relayMessage.getUserId() != null) {
                // User-specific message
                messagingTemplate.convertAndSendToUser(
                    relayMessage.getUserId(),
                    relayMessage.getDestination(),
                    relayMessage.getPayload()
                );
            } else {
                // Topic broadcast
                messagingTemplate.convertAndSend(
                    relayMessage.getDestination(),
                    relayMessage.getPayload()
                );
            }
 
        } catch (Exception e) {
            log.error("Failed to process Redis relay message: {}", e.getMessage());
        }
    }
}

Redis Message Listener Configuration

// src/main/java/com/example/ws/config/RedisConfig.java
 
package com.example.ws.config;
 
import com.example.ws.relay.RedisMessageRelay;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.redis.connection.RedisConnectionFactory;
import org.springframework.data.redis.listener.PatternTopic;
import org.springframework.data.redis.listener.RedisMessageListenerContainer;
import org.springframework.data.redis.listener.adapter.MessageListenerAdapter;
 
@Configuration
public class RedisConfig {
 
    /**
     * Creates a Redis message listener container that subscribes to
     * all WebSocket relay channels matching the pattern "ws:*".
     *
     * When any message is published to a "ws:..." channel,
     * the RedisMessageRelay.onMessage() method is called.
     */
    @Bean
    public RedisMessageListenerContainer redisListenerContainer(
            RedisConnectionFactory connectionFactory,
            RedisMessageRelay relay) {
 
        RedisMessageListenerContainer container = new RedisMessageListenerContainer();
        container.setConnectionFactory(connectionFactory);
 
        // Subscribe to ALL WebSocket relay channels using a pattern
        // This server instance will receive all messages published to ws:* channels
        container.addMessageListener(
            new MessageListenerAdapter(relay),
            new PatternTopic(RedisMessageRelay.CHANNEL_PREFIX + "*")
        );
 
        return container;
    }
}

Using the Relay in Services

// In ChatService - use relay instead of directly using messagingTemplate for broadcasts
 
@Service
@RequiredArgsConstructor
@Slf4j
public class ChatService {
 
    private final RedisMessageRelay relay;
    private final SimpMessagingTemplate messagingTemplate; // Still needed for local ops
    // ...
 
    @Transactional
    public void processAndBroadcast(String userId, String roomId, ChatMessageRequest request) {
        // Save to DB
        ChatMessage saved = messageRepository.save(buildMessage(userId, roomId, request));
        ChatMessageResponse response = buildResponse(saved, userId);
 
        // Publish via Redis relay - all server instances will deliver locally
        relay.publish("/topic/room." + roomId, response);
    }
 
    public void sendPrivateMessage(String targetUserId, ChatMessageResponse response) {
        // For user-specific messages, relay handles routing to correct server
        relay.publishToUser(targetUserId, "/queue/private", response);
    }
}

Relay Message DTO

// src/main/java/com/example/ws/dto/RelayMessage.java
 
package com.example.ws.dto;
 
import lombok.Data;
 
@Data
public class RelayMessage {
    private String destination;
    private String userId;     // null for topic broadcasts, set for user-specific messages
    private String payload;    // JSON serialized payload
    private String payloadType;
    private long timestamp = System.currentTimeMillis();
    private String originServerId; // For deduplication if needed
}

3. AWS Architecture Overview

Production Architecture Diagram

                              Internet
                                 |
                    +------------------------+
                    |   AWS ALB (HTTPS/WSS)  |
                    |   - SSL Termination    |
                    |   - Stickiness Cookie  |
                    |   - WS Upgrade Pass    |
                    +------------------------+
                        |              |
          +-------------+              +-------------+
          |                                          |
+-----------------+                       +-----------------+
| ECS Task        |                       | ECS Task        |
| WebSocket       |<--- Redis Pub/Sub --->| WebSocket       |
| Spring Boot     |         |             | Spring Boot     |
| Instance 1      |         |             | Instance 2      |
+-----------------+         |             +-----------------+
          |                 |                       |
          |    +---------------------------+         |
          +--> |  AWS ElastiCache Redis    | <-------+
               |  (Cluster Mode Disabled) |
               |  Primary + Replica       |
               +---------------------------+
                            |
               +---------------------------+
               |  Amazon RDS MySQL 8       |
               |  Multi-AZ, read replicas  |
               +---------------------------+
                            |
               +---------------------------+
               |  Amazon CloudWatch        |
               |  - WS connection count    |
               |  - Messages per second    |
               |  - Error rate             |
               +---------------------------+

Component Responsibilities

Component	Role
ALB	SSL termination, WebSocket upgrade pass-through, sticky sessions, health checks
ECS Fargate	Runs Spring Boot WebSocket containers, auto-scaling based on connection count
ElastiCache Redis	Cross-instance message relay, presence tracking, session storage
RDS MySQL	Persistent message storage, user data, room metadata
CloudWatch	Metrics, alarms, dashboards for operational visibility

4. ALB WebSocket Configuration

Key ALB Settings for WebSocket

ALB Target Group Settings for WebSocket:

Protocol:              HTTP (ALB terminates SSL, forwards HTTP to containers)
Port:                  8080
Health Check Path:     /actuator/health
Health Check Protocol: HTTP
Stickiness:            Enabled
Stickiness Type:       lb_cookie
Stickiness Duration:   86400 seconds (24 hours)

Why stickiness even with Redis?
- WebSocket upgrade requests: The initial HTTP upgrade must go to the same server
  where the WebSocket session will be maintained.
- After connection is established, the TCP connection is pinned to the server anyway.
- Stickiness ensures reconnection attempts also go back to the same server.
- Redis handles the case where the server dies and the client reconnects to a new server.

ALB Timeout Configuration

ALB Idle Timeout: 3600 seconds (1 hour)

DEFAULT ALB IDLE TIMEOUT IS 60 SECONDS.
This is a critical production gotcha.
WebSocket connections are idle during periods of low traffic (heartbeats only).
If ALB idle timeout < heartbeat interval, ALB will kill WebSocket connections.

Set idle timeout to at least 1 hour for WebSocket workloads.
Your heartbeat interval (25 seconds) keeps the connection active.

Terraform Configuration for ALB

# infrastructure/alb.tf
 
resource "aws_lb_listener" "websocket_https" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS13-1-2-2021-06"
  certificate_arn   = aws_acm_certificate.main.arn
 
  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.websocket.arn
  }
}
 
resource "aws_lb_target_group" "websocket" {
  name             = "websocket-tg"
  port             = 8080
  protocol         = "HTTP"
  vpc_id           = aws_vpc.main.id
  target_type      = "ip" # Required for ECS Fargate
 
  health_check {
    path                = "/actuator/health"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 15
    matcher             = "200"
  }
 
  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
    enabled         = true
  }
 
  # Critical: Increase idle timeout for WebSocket connections
  # Default is 60 seconds which will kill WebSocket connections
  lifecycle {
    create_before_destroy = true
  }
}
 
resource "aws_lb" "main" {
  name               = "websocket-alb"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.alb.id]
  subnets            = aws_subnet.public[*].id
 
  # CRITICAL: Set idle timeout high for WebSocket workloads
  idle_timeout = 3600
 
  access_logs {
    bucket  = aws_s3_bucket.alb_logs.id
    enabled = true
  }
}

5. ECS Fargate Deployment

ECS Task Definition

// ecs-task-definition.json
 
{
    "family": "websocket-service",
    "networkMode": "awsvpc",
    "requiresCompatibilities": ["FARGATE"],
    "cpu": "1024",
    "memory": "2048",
    "executionRoleArn": "arn:aws:iam::ACCOUNT_ID:role/ecsTaskExecutionRole",
    "taskRoleArn": "arn:aws:iam::ACCOUNT_ID:role/websocketTaskRole",
    "containerDefinitions": [
        {
            "name": "websocket-service",
            "image": "ACCOUNT_ID.dkr.ecr.us-east-1.amazonaws.com/websocket-service:latest",
            "portMappings": [
                {
                    "containerPort": 8080,
                    "protocol": "tcp"
                }
            ],
            "environment": [
                {
                    "name": "SPRING_PROFILES_ACTIVE",
                    "value": "production"
                },
                {
                    "name": "JAVA_OPTS",
                    "value": "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xms512m -Xmx1536m -XX:+ExitOnOutOfMemoryError"
                }
            ],
            "secrets": [
                {
                    "name": "DB_USERNAME",
                    "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:websocket/db-username"
                },
                {
                    "name": "DB_PASSWORD",
                    "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:websocket/db-password"
                },
                {
                    "name": "REDIS_HOST",
                    "valueFrom": "arn:aws:ssm:us-east-1:ACCOUNT_ID:parameter/websocket/redis-host"
                },
                {
                    "name": "JWT_SECRET",
                    "valueFrom": "arn:aws:secretsmanager:us-east-1:ACCOUNT_ID:secret:websocket/jwt-secret"
                }
            ],
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/websocket-service",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs"
                }
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "curl -f http://localhost:8080/actuator/health || exit 1"
                ],
                "interval": 15,
                "timeout": 5,
                "retries": 3,
                "startPeriod": 60
            },
            "ulimits": [
                {
                    "name": "nofile",
                    "softLimit": 65536,
                    "hardLimit": 65536
                }
            ]
        }
    ]
}

Auto-Scaling Configuration

# infrastructure/autoscaling.tf
 
# Scale on active WebSocket connections (custom CloudWatch metric)
resource "aws_appautoscaling_policy" "websocket_connections" {
  name               = "websocket-connection-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.websocket.resource_id
  scalable_dimension = aws_appautoscaling_target.websocket.scalable_dimension
  service_namespace  = aws_appautoscaling_target.websocket.service_namespace
 
  target_tracking_scaling_policy_configuration {
    # Target: keep average active connections per task below 5000
    # When average exceeds 5000, scale out. When below 3500, scale in.
    target_value = 5000
 
    customized_metric_specification {
      metric_name = "websocket.connections.active"
      namespace   = "WebSocketService"
      statistic   = "Average"
    }
 
    scale_in_cooldown  = 300  # 5 minutes - be slow to scale in (don't drop connections)
    scale_out_cooldown = 60   # 1 minute - be quick to scale out
  }
}
 
resource "aws_appautoscaling_target" "websocket" {
  max_capacity       = 20
  min_capacity       = 2
  resource_id        = "service/main-cluster/websocket-service"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

Dockerfile

# Dockerfile
 
FROM eclipse-temurin:17-jre-alpine AS runner
 
# Create non-root user for security
RUN addgroup -S wsapp && adduser -S wsapp -G wsapp
 
WORKDIR /app
 
# Copy the fat jar
COPY target/websocket-service-*.jar app.jar
 
# Set ownership
RUN chown wsapp:wsapp /app/app.jar
 
USER wsapp
 
# JVM options for container awareness
# UseContainerSupport ensures JVM respects container memory limits
ENV JAVA_OPTS="-XX:+UseContainerSupport \
               -XX:MaxRAMPercentage=75.0 \
               -XX:+UseG1GC \
               -XX:MaxGCPauseMillis=200 \
               -XX:+ExitOnOutOfMemoryError \
               -Djava.security.egd=file:/dev/./urandom"
 
EXPOSE 8080
 
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS -jar /app/app.jar"]

6. AWS ElastiCache Redis Configuration

ElastiCache Setup for WebSocket

# infrastructure/elasticache.tf
 
# Use Redis (non-cluster mode) for WebSocket pub/sub
# Cluster mode complicates pub/sub channel routing
resource "aws_elasticache_replication_group" "websocket" {
  replication_group_id = "websocket-redis"
  description          = "Redis for WebSocket pub/sub and presence"
 
  engine               = "redis"
  engine_version       = "7.0"
  node_type            = "cache.r6g.large"  # 13 GB RAM, good for pub/sub
 
  # Primary + 1 replica for HA
  num_cache_clusters   = 2
 
  # Multi-AZ for high availability
  multi_az_enabled     = true
  automatic_failover_enabled = true
 
  # Encryption at rest and in transit
  at_rest_encryption_enabled    = true
  transit_encryption_enabled    = true
 
  subnet_group_name    = aws_elasticache_subnet_group.main.name
  security_group_ids   = [aws_security_group.redis.id]
 
  # Maintenance window
  maintenance_window   = "sun:05:00-sun:06:00"
 
  # Backup
  snapshot_retention_limit = 7
  snapshot_window          = "03:00-04:00"
 
  # Parameter group for optimizing pub/sub
  parameter_group_name = aws_elasticache_parameter_group.websocket.name
}
 
resource "aws_elasticache_parameter_group" "websocket" {
  name   = "websocket-redis-params"
  family = "redis7"
 
  # Tune for pub/sub workloads
  parameter {
    name  = "hz"
    value = "20"  # Increase server cron frequency (default 10)
  }
 
  parameter {
    name  = "tcp-keepalive"
    value = "60"
  }
 
  parameter {
    name  = "timeout"
    value = "300"  # Close idle connections after 5 minutes
  }
 
  # Pub/sub notification settings
  parameter {
    name  = "notify-keyspace-events"
    value = ""  # Disable keyspace events unless needed (saves CPU)
  }
}

Spring Redis Configuration for Production

# application-production.yml
 
spring:
  data:
    redis:
      host: ${REDIS_HOST}
      port: 6379
      ssl:
        enabled: true
      timeout: 2000ms
      lettuce:
        pool:
          max-active: 20          # Max connections in the pool
          max-idle: 10            # Max idle connections
          min-idle: 5             # Min idle connections kept alive
          max-wait: 2000ms        # Max wait for a connection from pool
        shutdown-timeout: 100ms
        cluster:
          refresh:
            adaptive: true        # Dynamically refresh cluster topology
            period: 60s

7. Graceful Shutdown

Why Graceful Shutdown Matters

When ECS updates a task (new deployment), it sends SIGTERM to the running container. Without graceful shutdown:

All WebSocket connections are immediately dropped.
Clients receive an abrupt disconnect.
Clients attempt reconnection - thousands simultaneously, causing a reconnection thundering herd.
Messages in flight are lost.

With graceful shutdown:

New connections stop being accepted.
Existing connections are notified to reconnect.
Connections are given time to drain.
Service shuts down cleanly.

Graceful Shutdown Implementation

// src/main/java/com/example/ws/lifecycle/GracefulShutdownHandler.java
 
package com.example.ws.lifecycle;
 
import com.example.ws.dto.ServerShutdownNotice;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.context.ApplicationListener;
import org.springframework.context.event.ContextClosingEvent;
import org.springframework.messaging.simp.SimpMessagingTemplate;
import org.springframework.stereotype.Component;
 
import java.util.concurrent.TimeUnit;
 
/**
 * Handles graceful shutdown of WebSocket connections.
 *
 * Steps:
 * 1. Broadcast shutdown notice to all connected clients (they should reconnect elsewhere)
 * 2. Wait for in-flight messages to complete
 * 3. Allow time for clients to reconnect to other servers
 */
@Component
@RequiredArgsConstructor
@Slf4j
public class GracefulShutdownHandler implements ApplicationListener<ContextClosingEvent> {
 
    private final SimpMessagingTemplate messagingTemplate;
 
    @Override
    public void onApplicationEvent(ContextClosingEvent event) {
        log.info("Initiating graceful WebSocket shutdown...");
 
        try {
            // 1. Notify all connected clients that this server is shutting down
            // Clients should interpret this as a signal to reconnect
            ServerShutdownNotice notice = new ServerShutdownNotice(
                "Server maintenance. Please reconnect in a few seconds.",
                30 // Reconnect in 30 seconds to allow other servers to be ready
            );
 
            messagingTemplate.convertAndSend("/topic/server.events", notice);
 
            // 2. Wait for message to be delivered to clients
            TimeUnit.SECONDS.sleep(2);
 
            // 3. Wait for active requests to drain
            // Spring's actual graceful shutdown (server.shutdown=graceful) handles this
            // at the HTTP layer. We add extra time here for WebSocket draining.
            TimeUnit.SECONDS.sleep(5);
 
            log.info("Graceful WebSocket shutdown complete.");
 
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
            log.warn("Graceful shutdown interrupted");
        }
    }
}

Application Properties for Graceful Shutdown

# application.yml
 
server:
  shutdown: graceful  # Wait for in-flight requests before shutting down
 
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s  # Allow up to 30 seconds for shutdown

Entrypoint with Pre-Stop Hook (ECS)

In ECS task definitions, configure a stopTimeout to align with Spring's graceful shutdown:

// In ECS task definition
{
  "stopTimeout": 60
}

8. Health Checks and Monitoring

Custom WebSocket Health Indicator

// src/main/java/com/example/ws/health/WebSocketHealthIndicator.java
 
package com.example.ws.health;
 
import com.example.ws.listener.WebSocketEventListener;
import lombok.RequiredArgsConstructor;
import org.springframework.boot.actuate.health.Health;
import org.springframework.boot.actuate.health.HealthIndicator;
import org.springframework.data.redis.core.StringRedisTemplate;
import org.springframework.stereotype.Component;
 
@Component("websocket")
@RequiredArgsConstructor
public class WebSocketHealthIndicator implements HealthIndicator {
 
    private final StringRedisTemplate redisTemplate;
 
    @Override
    public Health health() {
        try {
            // Check Redis connectivity (critical for pub/sub)
            String pong = redisTemplate.getConnectionFactory()
                .getConnection()
                .ping();
 
            if (!"PONG".equals(pong)) {
                return Health.down()
                    .withDetail("redis", "Unexpected ping response: " + pong)
                    .build();
            }
 
            return Health.up()
                .withDetail("redis", "Connected")
                .build();
 
        } catch (Exception e) {
            return Health.down()
                .withDetail("error", e.getMessage())
                .build();
        }
    }
}

Metrics Configuration with Micrometer

// src/main/java/com/example/ws/metrics/WebSocketMetrics.java
 
package com.example.ws.metrics;
 
import io.micrometer.core.instrument.*;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;
 
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;
 
@Component
@Slf4j
public class WebSocketMetrics {
 
    private final AtomicInteger activeConnections = new AtomicInteger(0);
    private final AtomicLong totalMessagesReceived = new AtomicLong(0);
    private final AtomicLong totalMessagesSent = new AtomicLong(0);
    private final AtomicInteger errorCount = new AtomicInteger(0);
 
    private final Timer messageProcessingTimer;
    private final Counter connectionErrorCounter;
 
    public WebSocketMetrics(MeterRegistry registry) {
 
        // Gauge - reports current value
        Gauge.builder("websocket.connections.active", activeConnections, AtomicInteger::get)
            .description("Active WebSocket connections")
            .register(registry);
 
        Gauge.builder("websocket.messages.received.total", totalMessagesReceived, AtomicLong::get)
            .description("Total WebSocket messages received since startup")
            .register(registry);
 
        Gauge.builder("websocket.messages.sent.total", totalMessagesSent, AtomicLong::get)
            .description("Total WebSocket messages sent since startup")
            .register(registry);
 
        // Timer - tracks latency distribution
        messageProcessingTimer = Timer.builder("websocket.message.processing.time")
            .description("Time to process and broadcast a WebSocket message")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
 
        // Counter - increments
        connectionErrorCounter = Counter.builder("websocket.connection.errors")
            .description("WebSocket connection errors")
            .register(registry);
    }
 
    public void connectionOpened() { activeConnections.incrementAndGet(); }
    public void connectionClosed() { activeConnections.decrementAndGet(); }
    public void messageReceived() { totalMessagesReceived.incrementAndGet(); }
    public void messageSent() { totalMessagesSent.incrementAndGet(); }
    public void connectionError() { connectionErrorCounter.increment(); }
 
    public Timer.Sample startMessageTimer() { return Timer.start(); }
    public void stopMessageTimer(Timer.Sample sample) { sample.stop(messageProcessingTimer); }
}

9. CloudWatch Alerts and Dashboards

CloudWatch Alarms (Terraform)

# infrastructure/cloudwatch_alarms.tf
 
# Alert when active connections exceed 80% of max capacity
resource "aws_cloudwatch_metric_alarm" "high_connections" {
  alarm_name          = "websocket-high-connections"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "3"
  metric_name         = "websocket.connections.active"
  namespace           = "WebSocketService"
  period              = "60"
  statistic           = "Average"
  threshold           = "8000"  # 80% of 10000 max
  alarm_description   = "WebSocket connections nearing capacity. Scale out needed."
  alarm_actions       = [aws_sns_topic.alerts.arn]
  ok_actions          = [aws_sns_topic.alerts.arn]
}
 
# Alert on high error rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "websocket-high-errors"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "websocket.connection.errors"
  namespace           = "WebSocketService"
  period              = "300"
  statistic           = "Sum"
  threshold           = "100"
  alarm_description   = "WebSocket error rate is elevated"
  alarm_actions       = [aws_sns_topic.alerts.arn]
}
 
# Alert on low connection count (service may be down)
resource "aws_cloudwatch_metric_alarm" "low_connections" {
  alarm_name          = "websocket-low-connections"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "5"
  metric_name         = "websocket.connections.active"
  namespace           = "WebSocketService"
  period              = "60"
  statistic           = "Average"
  threshold           = "10"
  alarm_description   = "Unexpectedly low WebSocket connections - service may be unhealthy"
  alarm_actions       = [aws_sns_topic.critical_alerts.arn]
}
 
# ALB unhealthy hosts
resource "aws_cloudwatch_metric_alarm" "unhealthy_hosts" {
  alarm_name          = "websocket-unhealthy-hosts"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "UnHealthyHostCount"
  namespace           = "AWS/ApplicationELB"
  period              = "60"
  statistic           = "Average"
  threshold           = "0"
 
  dimensions = {
    TargetGroup  = aws_lb_target_group.websocket.arn_suffix
    LoadBalancer = aws_lb.main.arn_suffix
  }
 
  alarm_actions = [aws_sns_topic.critical_alerts.arn]
}

10. Production Tips and Hardening

Connection Limit Configuration

# Linux OS-level file descriptor limits
# WebSocket connections consume file descriptors
# Default Linux limit is 1024 per process - WAY too low for WebSocket servers
 
# In /etc/security/limits.conf:
# ec2-user soft nofile 65536
# ec2-user hard nofile 65536
 
# In /etc/sysctl.conf:
# fs.file-max = 1000000
# net.core.somaxconn = 65536
# net.ipv4.tcp_max_syn_backlog = 65536
# net.core.netdev_max_backlog = 65536

JVM Tuning for WebSocket Servers

# JVM flags for a WebSocket server with 10,000+ connections
 
JAVA_OPTS="\
  # G1GC is best for WebSocket servers (low pause times)
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=200 \
  -XX:G1HeapRegionSize=16m \
 
  # Heap: WebSocket connections consume memory for buffers
  # Each connection uses approx 50-100KB
  # 10,000 connections = 500MB to 1GB heap for buffers alone
  -Xms1g \
  -Xmx3g \
 
  # Exit on OOM instead of limping along in a broken state
  -XX:+ExitOnOutOfMemoryError \
 
  # Container awareness
  -XX:+UseContainerSupport \
  -XX:MaxRAMPercentage=75.0 \
 
  # Network performance
  -Djava.net.preferIPv4Stack=true \
 
  # Better random number generation (important for SSL)
  -Djava.security.egd=file:/dev/./urandom"

Network Tuning Tips

# Tune Tomcat for WebSocket workloads (application.yml)
server:
  tomcat:
    # Max simultaneous WebSocket connections (Tomcat NIO)
    max-connections: 10000
    # Accept queue size - connections waiting to be accepted
    accept-count: 200
    threads:
      max: 200       # Thread pool for handling HTTP (not WebSocket connections)
      min-spare: 20
    connection-timeout: 20000
    # Keep-alive timeout for HTTP (not WebSocket)
    keep-alive-timeout: 20000

Security Hardening Checklist

Production WebSocket Security Checklist:

[ ] Use WSS (WebSocket Secure) only - never plain WS in production
[ ] Validate JWT on EVERY message, not just on connection
[ ] Set explicit allowed origins - never use wildcard "*"
[ ] Rate limit messages per connection
[ ] Limit maximum message size
[ ] Validate all incoming payloads with Bean Validation (@Valid)
[ ] Implement connection limits per user (prevent one user from opening 1000 connections)
[ ] Log all authentication failures
[ ] Monitor for unusual traffic patterns (DDoS on WebSocket endpoints)
[ ] Rotate JWT secrets regularly
[ ] Use AWS WAF with ALB to block malicious IPs
[ ] Enable ALB access logs for audit trail
[ ] Use VPC security groups to allow only ALB to reach ECS containers
[ ] Store secrets in AWS Secrets Manager, not environment variables
[ ] Enable CloudWatch logs with log retention policy

Next: Part 5 - Pitfalls and Trade-offs

Series: Web Sockets Demystified