WebSockets Demystified - Part 5: Pitfalls, Trade-offs, and Anti-Patterns
Series: Index | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6
Table of Contents
- Critical Production Pitfalls
- Anti-Patterns to Avoid
- Real Production Challenges and Solutions
- Trade-Off Analysis
- Architectural Decision Framework
- Load Testing WebSocket Systems
- Debugging Guide
- Industry Practices Summary
1. Critical Production Pitfalls
Pitfall 1: The ALB Idle Timeout Trap
What happens:
Symptom: WebSocket connections silently drop after exactly 60 seconds of low activity.
Users report "chat disconnects randomly" or "dashboard stops updating."
Root cause: AWS ALB default idle timeout is 60 seconds.
Your heartbeat interval is 25 seconds... but heartbeats don't reset the ALB idle timer
unless actual data flows through.
Fix:
Set ALB idle timeout to 3600 seconds (1 hour).
The ALB considers a connection idle only when NO TCP data flows.
Your WebSocket heartbeats DO count as data, but set a generous timeout anyway.
AWS Console: EC2 > Load Balancers > your-alb > Attributes > Idle timeout
Terraform: idle_timeout = 3600 in aws_lb resource.
Pitfall 2: The Thundering Herd on Reconnect
What happens:
Scenario: A server instance is restarted (deployment, crash, or scale-down).
All 5,000 clients connected to it simultaneously disconnect.
All 5,000 clients have reconnection logic: "reconnect after 1 second."
All 5,000 clients reconnect at exactly the same time.
Your remaining servers receive 5,000 connection requests at T+1 second.
This spike overwhelms servers. They become slow/unresponsive.
More connections fail. More reconnects. Cascade failure.
Fix: Exponential Backoff with Jitter
// JavaScript client reconnection
function reconnect(attemptNumber) {
const baseDelay = 1000;
const maxDelay = 60000;
// Exponential backoff: 1s, 2s, 4s, 8s, 16s, ... capped at 60s
const exponentialDelay = Math.min(baseDelay * Math.pow(2, attemptNumber), maxDelay);
// Add random jitter: spread reconnects over a window
// Without jitter: 5000 clients all reconnect at exactly T+1s
// With jitter: 5000 clients reconnect uniformly over T+1s to T+2s
const jitter = Math.random() * 1000;
const totalDelay = exponentialDelay + jitter;
setTimeout(() => connect(), totalDelay);
}
Pitfall 3: Memory Leak from Zombie Sessions
What happens:
Scenario: A mobile user's network drops suddenly (enters subway, phone dies).
No TCP FIN is sent (ungraceful disconnect).
Server's TCP stack keeps the connection open (TCP half-open).
Server's WebSocket session map still holds the session.
Over hours, thousands of zombie sessions accumulate.
Server memory grows until OOM crash.
Fix: Heartbeat + Session Cleanup
// In WebSocketConfig.java - configure heartbeat
.enableSimpleBroker("/topic", "/queue")
.setHeartbeatValue(new long[]{25000, 25000}) // Both directions, 25 second interval
// Heartbeat behavior:
// 1. Server sends a heartbeat to all clients every 25 seconds.
// 2. If a client doesn't respond within 25000 * 1.5 = 37.5 seconds, session closes.
// 3. The SessionDisconnectEvent fires.
// 4. PresenceService.markOffline() cleans up Redis.
// 5. Memory is freed.
// Result: Zombie sessions detected and cleaned up within ~40 seconds.Pitfall 4: Not Persisting Messages Before Broadcasting
What happens:
Scenario:
1. User A sends a message.
2. Service broadcasts it to /topic/room.123 via WebSocket.
3. Service tries to save to MySQL - DB is temporarily down.
4. Exception thrown. Save fails.
5. User A's message disappeared from chat but everyone saw it briefly.
This creates a split-brain: some clients saw the message, DB doesn't have it.
On page reload, the message is gone. Users are confused.
Fix: DB first, then broadcast
@Transactional
public void processAndBroadcast(String userId, String roomId, ChatMessageRequest request) {
// Step 1: Save to DB FIRST (inside the transaction)
ChatMessage saved = messageRepository.save(buildMessage(userId, roomId, request));
// Step 2: After successful save, broadcast
// Use @TransactionalEventListener to ensure DB is committed BEFORE broadcasting
eventPublisher.publishEvent(new MessageSavedEvent(this, saved));
}
@TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
public void onMessageSaved(MessageSavedEvent event) {
// This fires ONLY AFTER the DB transaction committed successfully
// If DB save failed, this never fires, and no broadcast happens
ChatMessageResponse response = buildResponse(event.getMessage());
messagingTemplate.convertAndSend("/topic/room." + event.getMessage().getRoomId(), response);
}Pitfall 5: Wildcard Origin - A Security Hole
What happens:
Anti-pattern:
registry.addEndpoint("/ws")
.setAllowedOrigins("*") // NEVER DO THIS IN PRODUCTION
This allows ANY website to establish a WebSocket connection to your server
using the credentials of a logged-in user (cookies). This is a WebSocket
equivalent of a CSRF attack.
Example attack:
1. Evil website at evil.com has: new WebSocket("wss://bank.com/ws")
2. Browser sends cookies (session, token) with the request.
3. Connection is accepted because origins are not checked.
4. Attacker has full WebSocket access to bank.com on behalf of the victim.
Fix:
// Always specify exact origins
registry.addEndpoint("/ws")
.setAllowedOrigins(
"https://app.example.com",
"https://www.example.com"
);
// For local development only:
// registry.addEndpoint("/ws").setAllowedOriginPatterns("http://localhost:*")Pitfall 6: Blocking the WebSocket Thread with Slow DB Operations
What happens:
@MessageMapping("/chat.send/{roomId}")
public void sendMessage(@Payload ChatMessageRequest request, Principal principal) {
// BAD: This runs on the WebSocket thread pool
// If this DB call takes 500ms, the thread is blocked
// With 20 threads in the pool, only 20 simultaneous messages can be processed
// 21st message waits. 100th message times out.
chatMessageRepository.save(buildMessage(request, principal.getName()));
broadcastToRoom(request.getRoomId(), request);
}
Fix: Use Async Processing
@MessageMapping("/chat.send/{roomId}")
public void sendMessage(
@Payload ChatMessageRequest request,
Principal principal) {
// Queue the work to be processed asynchronously
// The WebSocket thread returns immediately
chatService.processAsync(principal.getName(), request.getRoomId(), request);
}
// In ChatService:
@Async("websocketTaskExecutor")
public void processAsync(String userId, String roomId, ChatMessageRequest request) {
// This runs on a separate thread pool, not the WebSocket thread
ChatMessage saved = messageRepository.save(buildMessage(userId, roomId, request));
broadcastToRoom(roomId, saved);
}
// Thread pool config:
@Bean("websocketTaskExecutor")
public Executor websocketTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(10);
executor.setMaxPoolSize(50);
executor.setQueueCapacity(500);
executor.setThreadNamePrefix("ws-async-");
executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
executor.initialize();
return executor;
}Pitfall 7: Not Handling Large Messages
What happens:
A client sends a 10 MB image as a base64-encoded string in a WebSocket message.
Spring tries to buffer the entire 10 MB message in memory.
Server has 20 concurrent large message senders = 200 MB of buffers.
Memory exhaustion. OOM. Server crashes.
Fix:
// Set message size limits in WebSocketConfig
@Override
public void configureWebSocketTransport(WebSocketTransportRegistration registration) {
registration
.setMessageSizeLimit(64 * 1024) // 64 KB max per message
.setSendBufferSizeLimit(512 * 1024) // 512 KB send buffer per connection
.setSendTimeLimit(20 * 1000); // 20 second send timeout
}
// For large content (files, images): use presigned S3 URLs instead of WebSocket
// Pattern:
// 1. Client requests presigned S3 upload URL via REST API
// 2. Client uploads file directly to S3
// 3. Client sends WebSocket message with S3 key reference
// 4. Server broadcasts the S3 URL to room members
// This keeps WebSocket messages small and fast.2. Anti-Patterns to Avoid
Anti-Pattern 1: Using WebSocket for Everything
The mistake: Team adopts WebSocket for ALL communication because
"it's faster and more modern."
Reality:
- WebSocket cannot be cached by CDN or browser.
- WebSocket responses cannot be indexed by search engines.
- WebSocket does not work well for file downloads.
- Load balancing is harder and more expensive.
- REST APIs are simpler, cacheable, stateless.
Rule: Use WebSocket ONLY for communication that genuinely needs:
1. Sub-second real-time push from server
2. Bidirectional communication
3. High-frequency updates
Use REST/HTTP for: CRUD operations, file uploads, authentication, search.
Anti-Pattern 2: Storing Application State Only in WebSocket Sessions
The mistake:
// Storing user's shopping cart in WebSocket session attributes
headerAccessor.getSessionAttributes().put("cart", userCart);
Problem: WebSocket sessions are volatile.
- Server restart: all sessions lost, all carts lost.
- Network drop: session lost.
- User refreshes browser: new session, cart gone.
- Horizontal scaling: session is on Server 1, user reconnects to Server 2.
Rule: WebSocket session attributes should hold ONLY transient state
(current room subscriptions, last seen timestamps).
Business state (cart, preferences) must live in DB or distributed cache.
Anti-Pattern 3: One Topic Per User for Broadcast
The mistake: Creating individual user topics for "broadcasts":
// Broadcasting to 10,000 users
for (String userId : allUserIds) {
messagingTemplate.convertAndSend("/topic/user." + userId, event);
}
Problems:
- Creates 10,000 topic entries.
- Iterating in a loop blocks the calling thread.
- Adds 10,000 Redis pub/sub publishes for a single event.
Correct approach: Use a shared topic that all users subscribe to.
// Server broadcasts once:
messagingTemplate.convertAndSend("/topic/announcements", event);
// All clients subscribed to /topic/announcements receive it.
// This is one Redis publish regardless of user count.
Anti-Pattern 4: No Rate Limiting
The mistake: Any connected user can send unlimited messages.
Attack:
1. Attacker connects 100 WebSocket sessions.
2. Each session sends 1000 messages per second.
3. 100,000 messages/second floods the server.
4. Legitimate users experience timeouts.
Fix: Rate limit by userId AND by sessionId.
A user with 3 tabs open should have a shared rate limit, not 3x the limit.
Anti-Pattern 5: Sending Entire Domain Objects Over WebSocket
The mistake:
// Sending the entire User entity to avoid building a DTO
messagingTemplate.convertAndSend("/topic/users", userEntity);
Problems:
- Exposes sensitive fields (password hash, internal IDs, DB version fields).
- Sends more data than needed, wasting bandwidth.
- Tight coupling: changing the DB schema breaks the WebSocket API.
- Large objects increase serialization time.
Fix: Always use dedicated DTOs for WebSocket messages.
Control exactly what fields are exposed.
Anti-Pattern 6: Using In-Memory Simple Broker in Production Cluster
The mistake:
registry.enableSimpleBroker("/topic", "/queue");
The simple broker is in-memory. It only knows about clients connected
to THIS server instance.
In a cluster of 3 servers:
- User A on Server 1 sends to /topic/room.123
- Simple broker on Server 1 delivers to Server 1 clients only
- Users on Server 2 and Server 3 NEVER receive the message
Fix: Add Redis relay (Part 4) or use RabbitMQ/ActiveMQ as a full broker.
Anti-Pattern 7: Ignoring the WebSocket Connection as a Security Boundary
The mistake: Validating identity only at connection time, never again.
Timeline:
T+0: User A authenticates. JWT valid. WebSocket connection opens.
T+1h: JWT expires.
T+1h+: User A's account is banned (fraud detection).
But WebSocket connection is still open.
User A can still send messages indefinitely.
Fix: Validate token and authorization on every STOMP SEND frame.
Implement token refresh over WebSocket or disconnect on expiry.
3. Real Production Challenges and Solutions
Challenge 1: Reconnection Storms After Deployment
Problem: Blue/green deployment in ECS tears down old tasks. All connected clients disconnect and reconnect in a wave. New tasks get slammed.
Solution Strategy:
1. Pre-warm new tasks: Start new tasks and wait for them to be healthy
(health check passing) before deregistering old tasks.
2. Deregister old tasks gradually:
- Remove 1 old task from ALB target group
- Wait 60 seconds (clients reconnect to remaining tasks)
- Remove next old task
- Repeat
3. Send server shutdown notice before deregistering:
messagingTemplate.convertAndSend("/topic/server.events",
new ShutdownNotice("Reconnect in 10-30 seconds"));
4. Client uses jittered exponential backoff (see Pitfall 2).
5. Result: Reconnection spread over 30-60 seconds instead of 1 second.
Challenge 2: Message Ordering Guarantees
Problem: User sends messages M1, M2, M3. Due to async processing, M3 arrives to Client B before M1.
Solution:
// Strategy 1: Sequence numbers
// Each message gets a monotonically increasing sequence number per room.
// Client displays messages in sequence order, not arrival order.
// Uses Redis INCR for atomic sequence generation.
long sequence = redisTemplate.opsForValue().increment("seq:room:" + roomId);
message.setSequence(sequence);
// Client buffers out-of-order messages and renders in sequence order.
// Strategy 2: Timestamp-based ordering
// Use server timestamp (not client) for ordering.
// Client sorts received messages by server timestamp.
// Strategy 3: Total order broadcast
// Use a single-threaded executor per room so messages
// for a room are always processed in order.
Map<String, ExecutorService> roomExecutors = new ConcurrentHashMap<>();
ExecutorService executor = roomExecutors.computeIfAbsent(
roomId, k -> Executors.newSingleThreadExecutor()
);
executor.submit(() -> processAndBroadcast(userId, roomId, request));Challenge 3: Handling Massive Fan-Out
Problem: A system event needs to be delivered to 1 million connected users simultaneously (e.g., "system maintenance in 5 minutes").
Solution:
Naive approach:
for (String userId : allMillionUsers) {
messagingTemplate.convertAndSendToUser(userId, "/queue/announce", event);
}
// This loops 1 million times. Takes minutes. Blocks everything.
Correct approach:
// All users subscribe to a shared topic
messagingTemplate.convertAndSend("/topic/announcements", event);
// One Redis publish. All servers deliver to their local clients.
// Scales linearly with server count, not user count.
For user-segmented broadcasts (e.g., all users in "premium" tier):
// Maintain separate topic per segment
messagingTemplate.convertAndSend("/topic/tier.premium", event);
// Premium users subscribe to /topic/tier.premium at login
Challenge 4: Database Connection Exhaustion
Problem: 10,000 WebSocket connections, each processing messages concurrently, all hit the DB simultaneously. DB connection pool (max 20) is exhausted. Messages queue up. Timeouts cascade.
Solution:
1. Separate the WebSocket thread pool from the DB processing pool.
WebSocket threads should return quickly (queue work, don't process inline).
2. Use a bounded work queue:
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(10);
executor.setMaxPoolSize(20); // Matches DB connection pool size
executor.setQueueCapacity(1000); // Queue up to 1000 pending DB operations
executor.setRejectedExecutionHandler(
new ThreadPoolExecutor.CallerRunsPolicy() // Backpressure to WebSocket layer
);
3. Consider batching: collect N messages, write in one DB transaction.
Reduces DB transactions from 10,000/s to 100/s (batches of 100).
4. Use connection pool monitoring:
management.endpoint.health.show-details=always
// Shows HikariCP pool details in /actuator/health
Challenge 5: WebSocket Behind Nginx or Corporate Proxy
Problem: Some corporate firewalls and proxies do not forward WebSocket upgrade headers. WebSocket connections silently fail. Clients see HTTP 400 errors.
Solution:
1. Use SockJS as fallback (Spring WebSocket supports this natively).
SockJS detects WebSocket support and falls back to:
- HTTP long-polling
- HTTP streaming
These work through almost all proxies.
2. Configure Nginx to pass WebSocket upgrade:
location /ws {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_read_timeout 3600s; # Keep connection alive
proxy_send_timeout 3600s;
}
3. Use port 443 (WSS) - almost no firewall blocks HTTPS port 443.
Plain WS on port 80 is sometimes stripped by proxies.
4. Trade-Off Analysis
WebSocket vs SSE - Detailed Trade-offs
| Factor | WebSocket | SSE |
|---|---|---|
| Direction | Bidirectional | Server to client only |
| Protocol | Custom WS protocol | HTTP/1.1 or HTTP/2 |
| Browser reconnect | Must implement in code | Built-in automatic |
| Load balancing | Stateful (sticky sessions needed) | Stateless (any server can handle) |
| HTTP/2 multiplexing | No - separate connection | Yes - multiple SSE streams on one HTTP/2 connection |
| Proxy/firewall compatibility | Sometimes blocked (need SockJS) | Always works (it's HTTP) |
| Server implementation complexity | High | Low |
| Client implementation complexity | Medium | Low (native EventSource API) |
| Throughput | Very high | High |
| Latency | Sub 5ms | 10-50ms (slightly higher) |
| Text vs Binary | Both | Text only (binary needs base64) |
| CDN compatibility | No | Partial |
WebSocket vs Message Queue (Kafka/SQS)
WebSocket is NOT a replacement for a message queue. They solve different problems.
Message Queue (Kafka, SQS):
- Durable message storage
- Consumer groups and replay
- Decoupling of producers and consumers
- Works even when consumers are offline
- Used for: async processing, event sourcing, microservice communication
WebSocket:
- Ephemeral, real-time delivery
- Connected clients only
- No persistence built in
- Used for: real-time UI updates, live notifications
Real-world integration pattern:
1. Order service publishes to Kafka: "ORDER_SHIPPED" event
2. Notification service consumes from Kafka
3. Notification service pushes via WebSocket to connected clients
4. Notification service saves to DB for offline users
In-Memory Broker vs Redis vs RabbitMQ
| Aspect | Simple (In-Memory) Broker | Redis Pub/Sub Relay | RabbitMQ / ActiveMQ |
|---|---|---|---|
| Setup complexity | None | Low | High |
| Multi-instance support | No - single server only | Yes | Yes |
| Message persistence | No | No (pub/sub is ephemeral) | Yes (queues are durable) |
| Throughput | Very high (no network) | High (network hop to Redis) | Medium (STOMP overhead) |
| Features | Basic | Basic | Full (acks, dead letter, routing) |
| Operational overhead | None | Low | High |
| Use case | Development, single instance | Production multi-instance | Enterprise, durability needed |
5. Architectural Decision Framework
Decision 1: Do You Need WebSocket?
START HERE: What is the update frequency?
Less than once per minute --> Simple polling (REST GET every 30s)
Once per minute to 1/s --> Long polling or SSE
Multiple times per second --> WebSocket or SSE
Does the client need to SEND data back in real time?
YES --> WebSocket
NO --> SSE (simpler, stateless, easier to scale)
Is this a chat / gaming / collaborative tool?
YES --> WebSocket
NO --> Evaluate SSE first
Decision 2: Single Server vs Clustered
Expected peak concurrent connections?
< 1,000 connections --> Single server with in-memory broker
Simplest. No Redis needed.
Spring Boot handles 10,000+ connections on one instance
with proper JVM tuning.
1,000 - 100,000 --> Multi-instance with Redis Pub/Sub
2-10 ECS tasks + ElastiCache Redis
This covers most production applications.
> 100,000 --> Consider dedicated WebSocket infrastructure:
- AWS API Gateway WebSocket API (fully managed)
- Purpose-built WS servers (Netty-based)
- Event-driven architecture (no long-lived connections)
Decision 3: AWS API Gateway WebSocket vs Self-Managed
AWS API Gateway WebSocket API:
Pros:
- Fully managed - no servers to maintain
- Scales automatically to millions of connections
- Pay per connection/message
- Built-in routing and integration with Lambda, HTTP backends
Cons:
- Higher cost at moderate scale (vs ECS)
- Limited STOMP/SockJS support (uses custom routing keys)
- Cold start latency with Lambda
- Complex fan-out (need to store connectionIds and iterate)
Self-managed (ECS + Spring Boot + Redis):
Pros:
- Full control over protocol and behavior
- STOMP/SockJS support
- Predictable cost at scale
- Existing Spring expertise applies
Cons:
- Operational overhead (ECS, Redis, ALB)
- Must implement scaling, health checks, graceful shutdown
Decision:
- Startups / moderate scale / Spring expertise --> ECS + Spring Boot + Redis
- Very large scale / serverless preference --> API Gateway WebSocket + Lambda
- Enterprise with RabbitMQ/ActiveMQ already running --> StompBrokerRelay
6. Load Testing WebSocket Systems
Tools
| Tool | Language | Best For |
|---|---|---|
| Gatling | Scala/Java | Large-scale WS load tests, CI integration |
| Artillery.io | Node.js | STOMP/SockJS load testing |
| JMeter (WS plugin) | Java | Existing JMeter users |
| k6 | JavaScript | Modern, cloud-native load tests |
| Locust | Python | Custom scenarios, easy scripting |
Artillery Load Test Configuration
# artillery-websocket-test.yml
config:
target: "wss://your-server.com"
phases:
# Ramp up: 0 to 100 users over 60 seconds
- duration: 60
arrivalRate: 2
name: "Ramp up"
# Sustained load: 100 concurrent users for 5 minutes
- duration: 300
arrivalRate: 100
name: "Sustained load"
# Spike test: 500 users for 1 minute
- duration: 60
arrivalRate: 500
name: "Spike test"
engines:
socketio:
# SockJS/Socket.IO compatible
defaults:
headers:
Authorization: "Bearer {{ $environment.TEST_TOKEN }}"
scenarios:
- name: "Chat user simulation"
engine: socketio
flow:
- emit:
channel: "connect"
data:
token: "{{ $environment.TEST_TOKEN }}"
- think: 2
- emit:
channel: "message"
data:
roomId: "load-test-room"
content: "Load test message {{ $uuid }}"
- think: 5
- loop:
- emit:
channel: "message"
data:
roomId: "load-test-room"
content: "Periodic message {{ $count }}"
- think: 10
count: 30Key Metrics to Monitor During Load Test
During load test, monitor:
Server metrics:
- Active WebSocket connections (should scale linearly with users)
- Message throughput (messages/second sent and received)
- Message processing latency (P50, P95, P99)
- Thread pool utilization (should not hit 100%)
- Heap memory usage (watch for leaks)
- GC pause times (G1GC should keep pauses < 200ms)
Database metrics:
- DB connection pool utilization
- Query latency
- Active connections
Redis metrics:
- Pub/sub message rate
- Memory usage
- Latency (should be sub 1ms)
Network metrics:
- ALB active connections
- ALB processed bytes
- Target response time
7. Debugging Guide
Debugging Connection Issues
// Enable Spring WebSocket debug logging
// In application.yml:
logging:
level:
org.springframework.web.socket: DEBUG
org.springframework.messaging: DEBUG
org.springframework.web.socket.handler: TRACE
// This logs every frame, subscription, and connection event.
// WARNING: Very verbose. Enable only for debugging, never in production.Debugging Redis Pub/Sub
# Monitor all Redis activity in real time
redis-cli MONITOR
# Check all subscriptions currently active
redis-cli PUBSUB CHANNELS "*"
# Expected output: ws:* channels for each active topic
# Check subscriber count per channel
redis-cli PUBSUB NUMSUB ws:/topic/room.room1
# Publish a test message manually
redis-cli PUBLISH ws:/topic/test "Hello"Debugging Stale Connections
-- Check MySQL for signs of WebSocket-related issues
-- Messages stuck in SENT status (not delivered) might indicate
-- connection issues
SELECT status, COUNT(*) as count
FROM chat_messages
WHERE created_at > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY status;
-- Users active recently but with large number of unread notifications
-- might indicate delivery failures
SELECT user_id, COUNT(*) as unread_count
FROM notifications
WHERE is_read = 0
AND created_at > DATE_SUB(NOW(), INTERVAL 1 HOUR)
GROUP BY user_id
ORDER BY unread_count DESC
LIMIT 20;Reading WebSocket Frames in Browser DevTools
Chrome DevTools > Network > Filter: WS > Click your WebSocket connection
> Messages tab
You will see:
- Green arrows (outgoing from browser to server)
- Red/White arrows (incoming from server to browser)
Each message shows:
- Timestamp
- Frame type (text/binary/ping/pong/close)
- Payload length
- Payload content
For STOMP messages, the format is:
SEND
destination:/app/chat.send/room1
content-type:application/json
{"content":"Hello"}
^@ (null byte frame terminator)
Look for:
- ERROR frames from server (configuration issues, auth failures)
- RECEIPT frames if you use transaction acknowledgments
- Unusual disconnect patterns (CLOSE frames with non-1000 status codes)
8. Industry Practices Summary
What Companies Actually Do in Production
| Practice | Industry Standard |
|---|---|
| Protocol | STOMP over WSS for most applications. Raw WebSocket for gaming. |
| Fallback | SockJS for enterprise deployments with proxy concerns |
| Authentication | JWT in query param or cookie. Not in custom headers (browsers don't support it) |
| Scaling | Redis Pub/Sub for most. RabbitMQ for durability. API Gateway for serverless |
| Persistence | Always save messages to DB before broadcasting |
| Presence | Redis SET with TTL. Not in-memory (multi-instance) |
| Rate limiting | Per user, per connection, combined |
| Reconnection | Exponential backoff with jitter on client side |
| Monitoring | Active connections, message rate, error rate, latency P99 |
| Security | Origin check, token validation, rate limiting, message size limit |
| Heartbeat | 25-30 second intervals (standard) |
| Load balancer | ALB with sticky sessions + Redis for cross-server delivery |
| Message delivery | At-least-once via Redis sorted set buffer + client-side deduplication |
What Netflix / Slack / Discord Do
Slack:
- Uses a custom protocol built on WebSocket
- Separate connection infrastructure (not part of main app servers)
- "Presence" is a dedicated microservice
- Message queue (Kafka) feeds WebSocket delivery service
- Offline message buffering in Redis
Discord:
- Go-based WebSocket servers (not JVM) for performance at scale
- ETF (Erlang Term Format) encoding for lower bandwidth
- Heartbeat: every 41.25 seconds (uses jitter from server)
- Presence is "lazy" - not propagated in real time but polled
Netflix:
- WebSocket for watch party and social features
- Server-Sent Events for most notification use cases
- Zuul (their own) handles WebSocket at edge
- Strong preference for SSE where bidirectionality not needed
Lessons:
1. Separate WebSocket servers from application servers at scale.
2. Never mix WebSocket traffic and REST traffic on the same instance at scale.
3. Presence is hard. Invest in it separately.
4. At true scale, Spring Boot WebSocket is replaced by custom solutions.
But Spring Boot handles comfortably up to 50,000-100,000 connections
per instance with proper tuning.