WebSockets Demystified - Part 5: Pitfalls, Trade-offs, and Anti-Patterns

Series: Index | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6

Critical Production Pitfalls
Anti-Patterns to Avoid
Real Production Challenges and Solutions
Trade-Off Analysis
Architectural Decision Framework
Load Testing WebSocket Systems
Debugging Guide
Industry Practices Summary

1. Critical Production Pitfalls

Pitfall 1: The ALB Idle Timeout Trap

What happens:

Symptom: WebSocket connections silently drop after exactly 60 seconds of low activity.
Users report "chat disconnects randomly" or "dashboard stops updating."

Root cause: AWS ALB default idle timeout is 60 seconds.
Your heartbeat interval is 25 seconds... but heartbeats don't reset the ALB idle timer
unless actual data flows through.

Fix:

Set ALB idle timeout to 3600 seconds (1 hour).
The ALB considers a connection idle only when NO TCP data flows.
Your WebSocket heartbeats DO count as data, but set a generous timeout anyway.

AWS Console: EC2 > Load Balancers > your-alb > Attributes > Idle timeout
Terraform: idle_timeout = 3600 in aws_lb resource.

Pitfall 2: The Thundering Herd on Reconnect

What happens:

Scenario: A server instance is restarted (deployment, crash, or scale-down).
All 5,000 clients connected to it simultaneously disconnect.
All 5,000 clients have reconnection logic: "reconnect after 1 second."
All 5,000 clients reconnect at exactly the same time.
Your remaining servers receive 5,000 connection requests at T+1 second.
This spike overwhelms servers. They become slow/unresponsive.
More connections fail. More reconnects. Cascade failure.

Fix: Exponential Backoff with Jitter

// JavaScript client reconnection
function reconnect(attemptNumber) {
    const baseDelay = 1000;
    const maxDelay = 60000;

    // Exponential backoff: 1s, 2s, 4s, 8s, 16s, ... capped at 60s
    const exponentialDelay = Math.min(baseDelay * Math.pow(2, attemptNumber), maxDelay);

    // Add random jitter: spread reconnects over a window
    // Without jitter: 5000 clients all reconnect at exactly T+1s
    // With jitter: 5000 clients reconnect uniformly over T+1s to T+2s
    const jitter = Math.random() * 1000;

    const totalDelay = exponentialDelay + jitter;

    setTimeout(() => connect(), totalDelay);
}

Pitfall 3: Memory Leak from Zombie Sessions

What happens:

Scenario: A mobile user's network drops suddenly (enters subway, phone dies).
No TCP FIN is sent (ungraceful disconnect).
Server's TCP stack keeps the connection open (TCP half-open).
Server's WebSocket session map still holds the session.
Over hours, thousands of zombie sessions accumulate.
Server memory grows until OOM crash.

Fix: Heartbeat + Session Cleanup

// In WebSocketConfig.java - configure heartbeat
.enableSimpleBroker("/topic", "/queue")
    .setHeartbeatValue(new long[]{25000, 25000}) // Both directions, 25 second interval
 
// Heartbeat behavior:
// 1. Server sends a heartbeat to all clients every 25 seconds.
// 2. If a client doesn't respond within 25000 * 1.5 = 37.5 seconds, session closes.
// 3. The SessionDisconnectEvent fires.
// 4. PresenceService.markOffline() cleans up Redis.
// 5. Memory is freed.
 
// Result: Zombie sessions detected and cleaned up within ~40 seconds.

Pitfall 4: Not Persisting Messages Before Broadcasting

What happens:

Scenario:
1. User A sends a message.
2. Service broadcasts it to /topic/room.123 via WebSocket.
3. Service tries to save to MySQL - DB is temporarily down.
4. Exception thrown. Save fails.
5. User A's message disappeared from chat but everyone saw it briefly.

This creates a split-brain: some clients saw the message, DB doesn't have it.
On page reload, the message is gone. Users are confused.

Fix: DB first, then broadcast

@Transactional
public void processAndBroadcast(String userId, String roomId, ChatMessageRequest request) {
    // Step 1: Save to DB FIRST (inside the transaction)
    ChatMessage saved = messageRepository.save(buildMessage(userId, roomId, request));
 
    // Step 2: After successful save, broadcast
    // Use @TransactionalEventListener to ensure DB is committed BEFORE broadcasting
    eventPublisher.publishEvent(new MessageSavedEvent(this, saved));
}
 
@TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
public void onMessageSaved(MessageSavedEvent event) {
    // This fires ONLY AFTER the DB transaction committed successfully
    // If DB save failed, this never fires, and no broadcast happens
    ChatMessageResponse response = buildResponse(event.getMessage());
    messagingTemplate.convertAndSend("/topic/room." + event.getMessage().getRoomId(), response);
}

Pitfall 5: Wildcard Origin - A Security Hole

What happens:

Anti-pattern:
registry.addEndpoint("/ws")
    .setAllowedOrigins("*")  // NEVER DO THIS IN PRODUCTION

This allows ANY website to establish a WebSocket connection to your server
using the credentials of a logged-in user (cookies). This is a WebSocket
equivalent of a CSRF attack.

Example attack:
1. Evil website at evil.com has: new WebSocket("wss://bank.com/ws")
2. Browser sends cookies (session, token) with the request.
3. Connection is accepted because origins are not checked.
4. Attacker has full WebSocket access to bank.com on behalf of the victim.

Fix:

// Always specify exact origins
registry.addEndpoint("/ws")
    .setAllowedOrigins(
        "https://app.example.com",
        "https://www.example.com"
    );
 
// For local development only:
// registry.addEndpoint("/ws").setAllowedOriginPatterns("http://localhost:*")

Pitfall 6: Blocking the WebSocket Thread with Slow DB Operations

What happens:

@MessageMapping("/chat.send/{roomId}")
public void sendMessage(@Payload ChatMessageRequest request, Principal principal) {
    // BAD: This runs on the WebSocket thread pool
    // If this DB call takes 500ms, the thread is blocked
    // With 20 threads in the pool, only 20 simultaneous messages can be processed
    // 21st message waits. 100th message times out.
    chatMessageRepository.save(buildMessage(request, principal.getName()));
    broadcastToRoom(request.getRoomId(), request);
}

Fix: Use Async Processing

@MessageMapping("/chat.send/{roomId}")
public void sendMessage(
        @Payload ChatMessageRequest request,
        Principal principal) {
 
    // Queue the work to be processed asynchronously
    // The WebSocket thread returns immediately
    chatService.processAsync(principal.getName(), request.getRoomId(), request);
}
 
// In ChatService:
@Async("websocketTaskExecutor")
public void processAsync(String userId, String roomId, ChatMessageRequest request) {
    // This runs on a separate thread pool, not the WebSocket thread
    ChatMessage saved = messageRepository.save(buildMessage(userId, roomId, request));
    broadcastToRoom(roomId, saved);
}
 
// Thread pool config:
@Bean("websocketTaskExecutor")
public Executor websocketTaskExecutor() {
    ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
    executor.setCorePoolSize(10);
    executor.setMaxPoolSize(50);
    executor.setQueueCapacity(500);
    executor.setThreadNamePrefix("ws-async-");
    executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
    executor.initialize();
    return executor;
}

Pitfall 7: Not Handling Large Messages

What happens:

A client sends a 10 MB image as a base64-encoded string in a WebSocket message.
Spring tries to buffer the entire 10 MB message in memory.
Server has 20 concurrent large message senders = 200 MB of buffers.
Memory exhaustion. OOM. Server crashes.

Fix:

// Set message size limits in WebSocketConfig
@Override
public void configureWebSocketTransport(WebSocketTransportRegistration registration) {
    registration
        .setMessageSizeLimit(64 * 1024)        // 64 KB max per message
        .setSendBufferSizeLimit(512 * 1024)     // 512 KB send buffer per connection
        .setSendTimeLimit(20 * 1000);           // 20 second send timeout
}
 
// For large content (files, images): use presigned S3 URLs instead of WebSocket
// Pattern:
// 1. Client requests presigned S3 upload URL via REST API
// 2. Client uploads file directly to S3
// 3. Client sends WebSocket message with S3 key reference
// 4. Server broadcasts the S3 URL to room members
// This keeps WebSocket messages small and fast.

2. Anti-Patterns to Avoid

Anti-Pattern 1: Using WebSocket for Everything

The mistake: Team adopts WebSocket for ALL communication because
"it's faster and more modern."

Reality:
- WebSocket cannot be cached by CDN or browser.
- WebSocket responses cannot be indexed by search engines.
- WebSocket does not work well for file downloads.
- Load balancing is harder and more expensive.
- REST APIs are simpler, cacheable, stateless.

Rule: Use WebSocket ONLY for communication that genuinely needs:
1. Sub-second real-time push from server
2. Bidirectional communication
3. High-frequency updates

Use REST/HTTP for: CRUD operations, file uploads, authentication, search.

Anti-Pattern 2: Storing Application State Only in WebSocket Sessions

The mistake:
// Storing user's shopping cart in WebSocket session attributes
headerAccessor.getSessionAttributes().put("cart", userCart);

Problem: WebSocket sessions are volatile.
- Server restart: all sessions lost, all carts lost.
- Network drop: session lost.
- User refreshes browser: new session, cart gone.
- Horizontal scaling: session is on Server 1, user reconnects to Server 2.

Rule: WebSocket session attributes should hold ONLY transient state
(current room subscriptions, last seen timestamps).
Business state (cart, preferences) must live in DB or distributed cache.

Anti-Pattern 3: One Topic Per User for Broadcast

The mistake: Creating individual user topics for "broadcasts":

// Broadcasting to 10,000 users
for (String userId : allUserIds) {
    messagingTemplate.convertAndSend("/topic/user." + userId, event);
}

Problems:
- Creates 10,000 topic entries.
- Iterating in a loop blocks the calling thread.
- Adds 10,000 Redis pub/sub publishes for a single event.

Correct approach: Use a shared topic that all users subscribe to.
// Server broadcasts once:
messagingTemplate.convertAndSend("/topic/announcements", event);

// All clients subscribed to /topic/announcements receive it.
// This is one Redis publish regardless of user count.

Anti-Pattern 4: No Rate Limiting

The mistake: Any connected user can send unlimited messages.

Attack:
1. Attacker connects 100 WebSocket sessions.
2. Each session sends 1000 messages per second.
3. 100,000 messages/second floods the server.
4. Legitimate users experience timeouts.

Fix: Rate limit by userId AND by sessionId.
A user with 3 tabs open should have a shared rate limit, not 3x the limit.

Anti-Pattern 5: Sending Entire Domain Objects Over WebSocket

The mistake:
// Sending the entire User entity to avoid building a DTO
messagingTemplate.convertAndSend("/topic/users", userEntity);

Problems:
- Exposes sensitive fields (password hash, internal IDs, DB version fields).
- Sends more data than needed, wasting bandwidth.
- Tight coupling: changing the DB schema breaks the WebSocket API.
- Large objects increase serialization time.

Fix: Always use dedicated DTOs for WebSocket messages.
Control exactly what fields are exposed.

Anti-Pattern 6: Using In-Memory Simple Broker in Production Cluster

The mistake:
registry.enableSimpleBroker("/topic", "/queue");

The simple broker is in-memory. It only knows about clients connected
to THIS server instance.

In a cluster of 3 servers:
- User A on Server 1 sends to /topic/room.123
- Simple broker on Server 1 delivers to Server 1 clients only
- Users on Server 2 and Server 3 NEVER receive the message

Fix: Add Redis relay (Part 4) or use RabbitMQ/ActiveMQ as a full broker.

Anti-Pattern 7: Ignoring the WebSocket Connection as a Security Boundary

The mistake: Validating identity only at connection time, never again.

Timeline:
T+0:   User A authenticates. JWT valid. WebSocket connection opens.
T+1h:  JWT expires.
T+1h+: User A's account is banned (fraud detection).
       But WebSocket connection is still open.
       User A can still send messages indefinitely.

Fix: Validate token and authorization on every STOMP SEND frame.
Implement token refresh over WebSocket or disconnect on expiry.

3. Real Production Challenges and Solutions

Challenge 1: Reconnection Storms After Deployment

Problem: Blue/green deployment in ECS tears down old tasks. All connected clients disconnect and reconnect in a wave. New tasks get slammed.

Solution Strategy:

1. Pre-warm new tasks: Start new tasks and wait for them to be healthy
   (health check passing) before deregistering old tasks.

2. Deregister old tasks gradually:
   - Remove 1 old task from ALB target group
   - Wait 60 seconds (clients reconnect to remaining tasks)
   - Remove next old task
   - Repeat

3. Send server shutdown notice before deregistering:
   messagingTemplate.convertAndSend("/topic/server.events",
       new ShutdownNotice("Reconnect in 10-30 seconds"));

4. Client uses jittered exponential backoff (see Pitfall 2).

5. Result: Reconnection spread over 30-60 seconds instead of 1 second.

Challenge 2: Message Ordering Guarantees

Problem: User sends messages M1, M2, M3. Due to async processing, M3 arrives to Client B before M1.

Solution:

// Strategy 1: Sequence numbers
// Each message gets a monotonically increasing sequence number per room.
// Client displays messages in sequence order, not arrival order.
// Uses Redis INCR for atomic sequence generation.
 
long sequence = redisTemplate.opsForValue().increment("seq:room:" + roomId);
message.setSequence(sequence);
 
// Client buffers out-of-order messages and renders in sequence order.
 
// Strategy 2: Timestamp-based ordering
// Use server timestamp (not client) for ordering.
// Client sorts received messages by server timestamp.
 
// Strategy 3: Total order broadcast
// Use a single-threaded executor per room so messages
// for a room are always processed in order.
Map<String, ExecutorService> roomExecutors = new ConcurrentHashMap<>();
ExecutorService executor = roomExecutors.computeIfAbsent(
    roomId, k -> Executors.newSingleThreadExecutor()
);
executor.submit(() -> processAndBroadcast(userId, roomId, request));

Challenge 3: Handling Massive Fan-Out

Problem: A system event needs to be delivered to 1 million connected users simultaneously (e.g., "system maintenance in 5 minutes").

Solution:

Naive approach:
  for (String userId : allMillionUsers) {
      messagingTemplate.convertAndSendToUser(userId, "/queue/announce", event);
  }
  // This loops 1 million times. Takes minutes. Blocks everything.

Correct approach:
  // All users subscribe to a shared topic
  messagingTemplate.convertAndSend("/topic/announcements", event);
  // One Redis publish. All servers deliver to their local clients.
  // Scales linearly with server count, not user count.

For user-segmented broadcasts (e.g., all users in "premium" tier):
  // Maintain separate topic per segment
  messagingTemplate.convertAndSend("/topic/tier.premium", event);
  // Premium users subscribe to /topic/tier.premium at login

Challenge 4: Database Connection Exhaustion

Problem: 10,000 WebSocket connections, each processing messages concurrently, all hit the DB simultaneously. DB connection pool (max 20) is exhausted. Messages queue up. Timeouts cascade.

Solution:

1. Separate the WebSocket thread pool from the DB processing pool.
   WebSocket threads should return quickly (queue work, don't process inline).

2. Use a bounded work queue:
   ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
   executor.setCorePoolSize(10);
   executor.setMaxPoolSize(20);       // Matches DB connection pool size
   executor.setQueueCapacity(1000);   // Queue up to 1000 pending DB operations
   executor.setRejectedExecutionHandler(
       new ThreadPoolExecutor.CallerRunsPolicy()  // Backpressure to WebSocket layer
   );

3. Consider batching: collect N messages, write in one DB transaction.
   Reduces DB transactions from 10,000/s to 100/s (batches of 100).

4. Use connection pool monitoring:
   management.endpoint.health.show-details=always
   // Shows HikariCP pool details in /actuator/health

Challenge 5: WebSocket Behind Nginx or Corporate Proxy

Problem: Some corporate firewalls and proxies do not forward WebSocket upgrade headers. WebSocket connections silently fail. Clients see HTTP 400 errors.

Solution:

1. Use SockJS as fallback (Spring WebSocket supports this natively).
   SockJS detects WebSocket support and falls back to:
   - HTTP long-polling
   - HTTP streaming
   These work through almost all proxies.

2. Configure Nginx to pass WebSocket upgrade:
   location /ws {
       proxy_pass http://backend;
       proxy_http_version 1.1;
       proxy_set_header Upgrade $http_upgrade;
       proxy_set_header Connection "upgrade";
       proxy_set_header Host $host;
       proxy_read_timeout 3600s;    # Keep connection alive
       proxy_send_timeout 3600s;
   }

3. Use port 443 (WSS) - almost no firewall blocks HTTPS port 443.
   Plain WS on port 80 is sometimes stripped by proxies.

4. Trade-Off Analysis

WebSocket vs SSE - Detailed Trade-offs

Factor	WebSocket	SSE
Direction	Bidirectional	Server to client only
Protocol	Custom WS protocol	HTTP/1.1 or HTTP/2
Browser reconnect	Must implement in code	Built-in automatic
Load balancing	Stateful (sticky sessions needed)	Stateless (any server can handle)
HTTP/2 multiplexing	No - separate connection	Yes - multiple SSE streams on one HTTP/2 connection
Proxy/firewall compatibility	Sometimes blocked (need SockJS)	Always works (it's HTTP)
Server implementation complexity	High	Low
Client implementation complexity	Medium	Low (native EventSource API)
Throughput	Very high	High
Latency	Sub 5ms	10-50ms (slightly higher)
Text vs Binary	Both	Text only (binary needs base64)
CDN compatibility	No	Partial

WebSocket vs Message Queue (Kafka/SQS)

WebSocket is NOT a replacement for a message queue. They solve different problems.

Message Queue (Kafka, SQS):
- Durable message storage
- Consumer groups and replay
- Decoupling of producers and consumers
- Works even when consumers are offline
- Used for: async processing, event sourcing, microservice communication

WebSocket:
- Ephemeral, real-time delivery
- Connected clients only
- No persistence built in
- Used for: real-time UI updates, live notifications

Real-world integration pattern:
  1. Order service publishes to Kafka: "ORDER_SHIPPED" event
  2. Notification service consumes from Kafka
  3. Notification service pushes via WebSocket to connected clients
  4. Notification service saves to DB for offline users

In-Memory Broker vs Redis vs RabbitMQ

Aspect	Simple (In-Memory) Broker	Redis Pub/Sub Relay	RabbitMQ / ActiveMQ
Setup complexity	None	Low	High
Multi-instance support	No - single server only	Yes	Yes
Message persistence	No	No (pub/sub is ephemeral)	Yes (queues are durable)
Throughput	Very high (no network)	High (network hop to Redis)	Medium (STOMP overhead)
Features	Basic	Basic	Full (acks, dead letter, routing)
Operational overhead	None	Low	High
Use case	Development, single instance	Production multi-instance	Enterprise, durability needed

5. Architectural Decision Framework

Decision 1: Do You Need WebSocket?

START HERE: What is the update frequency?

Less than once per minute  --> Simple polling (REST GET every 30s)
Once per minute to 1/s     --> Long polling or SSE
Multiple times per second  --> WebSocket or SSE

Does the client need to SEND data back in real time?
YES --> WebSocket
NO  --> SSE (simpler, stateless, easier to scale)

Is this a chat / gaming / collaborative tool?
YES --> WebSocket
NO  --> Evaluate SSE first

Decision 2: Single Server vs Clustered

Expected peak concurrent connections?

< 1,000 connections  --> Single server with in-memory broker
                         Simplest. No Redis needed.
                         Spring Boot handles 10,000+ connections on one instance
                         with proper JVM tuning.

1,000 - 100,000      --> Multi-instance with Redis Pub/Sub
                         2-10 ECS tasks + ElastiCache Redis
                         This covers most production applications.

> 100,000            --> Consider dedicated WebSocket infrastructure:
                         - AWS API Gateway WebSocket API (fully managed)
                         - Purpose-built WS servers (Netty-based)
                         - Event-driven architecture (no long-lived connections)

Decision 3: AWS API Gateway WebSocket vs Self-Managed

AWS API Gateway WebSocket API:
  Pros:
  - Fully managed - no servers to maintain
  - Scales automatically to millions of connections
  - Pay per connection/message
  - Built-in routing and integration with Lambda, HTTP backends
  Cons:
  - Higher cost at moderate scale (vs ECS)
  - Limited STOMP/SockJS support (uses custom routing keys)
  - Cold start latency with Lambda
  - Complex fan-out (need to store connectionIds and iterate)

Self-managed (ECS + Spring Boot + Redis):
  Pros:
  - Full control over protocol and behavior
  - STOMP/SockJS support
  - Predictable cost at scale
  - Existing Spring expertise applies
  Cons:
  - Operational overhead (ECS, Redis, ALB)
  - Must implement scaling, health checks, graceful shutdown

Decision:
  - Startups / moderate scale / Spring expertise --> ECS + Spring Boot + Redis
  - Very large scale / serverless preference --> API Gateway WebSocket + Lambda
  - Enterprise with RabbitMQ/ActiveMQ already running --> StompBrokerRelay

6. Load Testing WebSocket Systems

Tools

Tool	Language	Best For
Gatling	Scala/Java	Large-scale WS load tests, CI integration
Artillery.io	Node.js	STOMP/SockJS load testing
JMeter (WS plugin)	Java	Existing JMeter users
k6	JavaScript	Modern, cloud-native load tests
Locust	Python	Custom scenarios, easy scripting

Artillery Load Test Configuration

# artillery-websocket-test.yml
 
config:
  target: "wss://your-server.com"
  phases:
    # Ramp up: 0 to 100 users over 60 seconds
    - duration: 60
      arrivalRate: 2
      name: "Ramp up"
    # Sustained load: 100 concurrent users for 5 minutes
    - duration: 300
      arrivalRate: 100
      name: "Sustained load"
    # Spike test: 500 users for 1 minute
    - duration: 60
      arrivalRate: 500
      name: "Spike test"
 
  engines:
    socketio:
      # SockJS/Socket.IO compatible
 
  defaults:
    headers:
      Authorization: "Bearer {{ $environment.TEST_TOKEN }}"
 
scenarios:
  - name: "Chat user simulation"
    engine: socketio
    flow:
      - emit:
          channel: "connect"
          data:
            token: "{{ $environment.TEST_TOKEN }}"
      - think: 2
      - emit:
          channel: "message"
          data:
            roomId: "load-test-room"
            content: "Load test message {{ $uuid }}"
      - think: 5
      - loop:
          - emit:
              channel: "message"
              data:
                roomId: "load-test-room"
                content: "Periodic message {{ $count }}"
          - think: 10
        count: 30

Key Metrics to Monitor During Load Test

During load test, monitor:

Server metrics:
  - Active WebSocket connections (should scale linearly with users)
  - Message throughput (messages/second sent and received)
  - Message processing latency (P50, P95, P99)
  - Thread pool utilization (should not hit 100%)
  - Heap memory usage (watch for leaks)
  - GC pause times (G1GC should keep pauses < 200ms)

Database metrics:
  - DB connection pool utilization
  - Query latency
  - Active connections

Redis metrics:
  - Pub/sub message rate
  - Memory usage
  - Latency (should be sub 1ms)

Network metrics:
  - ALB active connections
  - ALB processed bytes
  - Target response time

7. Debugging Guide

Debugging Connection Issues

// Enable Spring WebSocket debug logging
// In application.yml:
logging:
  level:
    org.springframework.web.socket: DEBUG
    org.springframework.messaging: DEBUG
    org.springframework.web.socket.handler: TRACE
 
// This logs every frame, subscription, and connection event.
// WARNING: Very verbose. Enable only for debugging, never in production.

Debugging Redis Pub/Sub

# Monitor all Redis activity in real time
redis-cli MONITOR
 
# Check all subscriptions currently active
redis-cli PUBSUB CHANNELS "*"
# Expected output: ws:* channels for each active topic
 
# Check subscriber count per channel
redis-cli PUBSUB NUMSUB ws:/topic/room.room1
 
# Publish a test message manually
redis-cli PUBLISH ws:/topic/test "Hello"

Debugging Stale Connections

-- Check MySQL for signs of WebSocket-related issues
 
-- Messages stuck in SENT status (not delivered) might indicate
-- connection issues
SELECT status, COUNT(*) as count
FROM chat_messages
WHERE created_at > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY status;
 
-- Users active recently but with large number of unread notifications
-- might indicate delivery failures
SELECT user_id, COUNT(*) as unread_count
FROM notifications
WHERE is_read = 0
  AND created_at > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY user_id
ORDER BY unread_count DESC
LIMIT 20;

Reading WebSocket Frames in Browser DevTools

Chrome DevTools > Network > Filter: WS > Click your WebSocket connection
> Messages tab

You will see:
- Green arrows (outgoing from browser to server)
- Red/White arrows (incoming from server to browser)

Each message shows:
- Timestamp
- Frame type (text/binary/ping/pong/close)
- Payload length
- Payload content

For STOMP messages, the format is:
SEND
destination:/app/chat.send/room1
content-type:application/json

{"content":"Hello"}
^@  (null byte frame terminator)

Look for:
- ERROR frames from server (configuration issues, auth failures)
- RECEIPT frames if you use transaction acknowledgments
- Unusual disconnect patterns (CLOSE frames with non-1000 status codes)

8. Industry Practices Summary

What Companies Actually Do in Production

Practice	Industry Standard
Protocol	STOMP over WSS for most applications. Raw WebSocket for gaming.
Fallback	SockJS for enterprise deployments with proxy concerns
Authentication	JWT in query param or cookie. Not in custom headers (browsers don't support it)
Scaling	Redis Pub/Sub for most. RabbitMQ for durability. API Gateway for serverless
Persistence	Always save messages to DB before broadcasting
Presence	Redis SET with TTL. Not in-memory (multi-instance)
Rate limiting	Per user, per connection, combined
Reconnection	Exponential backoff with jitter on client side
Monitoring	Active connections, message rate, error rate, latency P99
Security	Origin check, token validation, rate limiting, message size limit
Heartbeat	25-30 second intervals (standard)
Load balancer	ALB with sticky sessions + Redis for cross-server delivery
Message delivery	At-least-once via Redis sorted set buffer + client-side deduplication

What Netflix / Slack / Discord Do

Slack:
- Uses a custom protocol built on WebSocket
- Separate connection infrastructure (not part of main app servers)
- "Presence" is a dedicated microservice
- Message queue (Kafka) feeds WebSocket delivery service
- Offline message buffering in Redis

Discord:
- Go-based WebSocket servers (not JVM) for performance at scale
- ETF (Erlang Term Format) encoding for lower bandwidth
- Heartbeat: every 41.25 seconds (uses jitter from server)
- Presence is "lazy" - not propagated in real time but polled

Netflix:
- WebSocket for watch party and social features
- Server-Sent Events for most notification use cases
- Zuul (their own) handles WebSocket at edge
- Strong preference for SSE where bidirectionality not needed

Lessons:
1. Separate WebSocket servers from application servers at scale.
2. Never mix WebSocket traffic and REST traffic on the same instance at scale.
3. Presence is hard. Invest in it separately.
4. At true scale, Spring Boot WebSocket is replaced by custom solutions.
   But Spring Boot handles comfortably up to 50,000-100,000 connections
   per instance with proper tuning.

Next: Part 6 - Interview Questions

Series: Web Sockets Demystified

WebSockets Demystified - Part 5: Pitfalls, Trade-offs, and Anti-Patterns

Table of Contents

1. Critical Production Pitfalls

Pitfall 1: The ALB Idle Timeout Trap

Pitfall 2: The Thundering Herd on Reconnect

Pitfall 3: Memory Leak from Zombie Sessions

Pitfall 4: Not Persisting Messages Before Broadcasting

Pitfall 5: Wildcard Origin - A Security Hole

Pitfall 6: Blocking the WebSocket Thread with Slow DB Operations

Pitfall 7: Not Handling Large Messages

2. Anti-Patterns to Avoid

Anti-Pattern 1: Using WebSocket for Everything

Anti-Pattern 2: Storing Application State Only in WebSocket Sessions

Anti-Pattern 3: One Topic Per User for Broadcast

Anti-Pattern 4: No Rate Limiting

Anti-Pattern 5: Sending Entire Domain Objects Over WebSocket

Anti-Pattern 6: Using In-Memory Simple Broker in Production Cluster

Anti-Pattern 7: Ignoring the WebSocket Connection as a Security Boundary

3. Real Production Challenges and Solutions

Challenge 1: Reconnection Storms After Deployment

Challenge 2: Message Ordering Guarantees

Challenge 3: Handling Massive Fan-Out

Challenge 4: Database Connection Exhaustion

Challenge 5: WebSocket Behind Nginx or Corporate Proxy

4. Trade-Off Analysis

WebSocket vs SSE - Detailed Trade-offs

WebSocket vs Message Queue (Kafka/SQS)

In-Memory Broker vs Redis vs RabbitMQ

5. Architectural Decision Framework

Decision 1: Do You Need WebSocket?

Decision 2: Single Server vs Clustered

Decision 3: AWS API Gateway WebSocket vs Self-Managed

6. Load Testing WebSocket Systems

Tools

Artillery Load Test Configuration

Key Metrics to Monitor During Load Test

7. Debugging Guide

Debugging Connection Issues

Debugging Redis Pub/Sub

Debugging Stale Connections

Reading WebSocket Frames in Browser DevTools

8. Industry Practices Summary

What Companies Actually Do in Production

What Netflix / Slack / Discord Do