Rate Limiting - Supplement 2: Real Production Challenges and Solutions
Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 3 - Trade-Offs and Decision Guide |
Supplement 4 - Architecture Patterns
Twenty real production scenarios where rate limiting goes wrong.
Each includes: what happened, warning signs, root cause, diagnosis steps,
and the concrete fix applied in production systems.
Table of Contents
- Redis Memory Explosion at Scale
- Rate Limiter Becomes the Bottleneck
- Split-Brain During Redis Failover
- IP Rotation Attack Bypasses Per-IP Limits
- Carrier Grade NAT Blocking Thousands of Legitimate Users
- Clock Skew Causing Window Boundary Gaps
- Thundering Herd After Maintenance Window
- Marketing Campaign Triggers False Positive Rate Limiting
- Blue-Green Deployment Resets Rate Limit State
- Database Connection Pool Exhaustion from Rate Limit Metadata Queries
- Internal Service Fan-Out DoS
- Background Job Bug Exhausts Entire Daily Quota
- Multi-Tenant Quota Bleeding
- Lua Script Timeout Under Load
- Missing Retry-After Causes Self-Inflicted Retry Storm
- Hot Key Problem in Redis Cluster
- Rate Limit Bypass via HTTP Method Variation
- False Positive During Feature Flag Rollout
- WebSocket Orphan Connection Accumulation
- Multi-Region Request Duplication Exhausting Limits
Challenge 1: Redis Memory Explosion at Scale
What Happened
A production API with 500,000 active users deployed sliding window log rate limiting
(using Redis Sorted Sets). Everything worked fine in testing (1,000 users).
When the real user base onboarded over 3 months, Redis memory climbed from 2GB to 18GB
and OOM-killed the Redis process at 3 AM.
Warning Signs
- Redis memory usage growing linearly with user count
- Redis
INFO memoryshowingused_memory_humanclimbing every day - No corresponding drop in memory even when traffic drops overnight
DBSIZEcommand showing millions of keys
Root Cause
Sliding window log stores a timestamp per request. With limit=500 requests/minute and
500,000 users each making 200 requests/hour:
500,000 users x 200 requests/hour x ~20 bytes per ZADD entry = 2 GB/hour
Keys are kept for 2x the window = 2 minutes
So at steady state: 500,000 users x 200/60 requests/minute x 20 bytes x 2 = 66 MB
But! ZADD members are not evicted until the next request from that user arrives.
Users who stopped making requests keep their last-minute worth of entries forever.
Active users: 500,000. But total unique users who have EVER made a request: 5,000,000.
5,000,000 users x 500 entries x 20 bytes = 50 GB. OOM.
Diagnosis Steps
# Check Redis memory breakdown
redis-cli INFO memory
# Count rate limit keys
redis-cli SCAN 0 MATCH "rl:swl:*" COUNT 10000 | head -20
# Check key count with pattern
redis-cli --scan --pattern "rl:swl:*" | wc -l
# Check memory usage of one key
redis-cli MEMORY USAGE "rl:swl:user123"
# Find largest keys (use carefully in production - SCAN is O(n))
redis-cli --bigkeys
# Check TTL of a key (should be 2x window, not -1 which means no TTL)
redis-cli TTL "rl:swl:user123"The Fix
Immediate (stop the bleeding):
# Set maxmemory and eviction policy
redis-cli CONFIG SET maxmemory 8gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru
# LRU eviction: least recently used keys are evicted first
# Rate limit keys for inactive users will be evicted before application dataShort-term (reduce memory usage):
# Switch from Sliding Window Log to Sliding Window Counter (O(1) memory per user)
# Old: O(limit) = O(500) per user
# New: O(1) - two integers per user
# Lua script: sliding window counter (2 integer keys per user, not 500 sorted set entries)
COUNTER_SCRIPT = """
local base = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local win_id = math.floor(now / window)
local curr_key = base .. ':c:' .. win_id
local prev_key = base .. ':c:' .. (win_id - 1)
local elapsed_frac = (now % window) / window
local curr = tonumber(redis.call('GET', curr_key) or '0')
local prev = tonumber(redis.call('GET', prev_key) or '0')
local estimate = prev * (1 - elapsed_frac) + curr
if estimate < limit then
redis.call('INCR', curr_key)
redis.call('EXPIRE', curr_key, window * 2)
return {1, math.floor(estimate) + 1}
end
return {0, math.floor(estimate)}
"""
# Memory: 500,000 users x 2 keys x 30 bytes = 30 MB (vs 50 GB). 1666x reduction.Long-term (capacity planning):
Memory per algorithm at 500K users with limit=500/min:
Sliding Window Log: 50 GB (avoid at scale)
Sliding Window Counter: 30 MB (use this)
Token Bucket: 20 MB (also fine)
Fixed Window Counter: 15 MB (also fine)
Rule: Always calculate memory at target scale BEFORE choosing algorithm.
Memory = users x entries_per_user x bytes_per_entry
Challenge 2: Rate Limiter Becomes the Bottleneck
What Happened
A payment processing API with strict latency SLAs (p99 < 100ms) deployed Redis-based
rate limiting. Under normal load everything was fine. During peak (holiday season,
10x traffic), p99 latency jumped from 80ms to 350ms. Investigation revealed that
Redis rate limit checks were taking 40-90ms at p99 due to Redis CPU saturation.
Warning Signs
- API p99 latency spikes correlate exactly with rate limit check timing
- Redis CPU usage consistently above 70%
- Redis
SLOWLOGshows rate limit commands taking >10ms - Rate limit checks showing in distributed traces as the slowest span
Root Cause
Normal load: 1,000 RPS x 1 Redis round trip = 1,000 Redis ops/sec (Redis handles fine)
Peak load: 10,000 RPS x 1 Redis round trip = 10,000 Redis ops/sec (Redis at 80% CPU)
Plus: Each rate limit check requires a pipeline of 2-3 commands
Plus: Lua script execution
Plus: Network RTT (2ms average, 45ms p99 under load due to queuing)
Diagnosis
# Check Redis operations per second
redis-cli INFO stats | grep instantaneous_ops_per_sec
# Check Redis CPU
redis-cli INFO cpu
# Check slow commands
redis-cli SLOWLOG GET 25
# Check command latency
redis-cli --latency-history -i 1
# Find most frequent rate limit operations
redis-cli MONITOR | grep "rl:" | head -100
# (Use MONITOR only briefly - it impacts Redis performance)The Fix
Option 1: Pipeline batching (quick win)
# Before: 2 Redis round trips per request (INCR + EXPIRE)
count = r.incr(key)
r.expire(key, 60)
# After: 1 Redis round trip (pipeline executes both atomically-ish)
pipe = r.pipeline(transaction=False) # transaction=False = no MULTI/EXEC overhead
pipe.incr(key)
pipe.expire(key, 60)
count, _ = pipe.execute()
# Reduces network round trips by 50%. Significant improvement.Option 2: Local cache for non-critical checks
import time
from cachetools import TTLCache
# Cache rate limit decisions for 500ms
# For 10,000 RPS and 5,000 unique users, only 2 Redis calls/user/500ms instead of 10
_local_cache = TTLCache(maxsize=50_000, ttl=0.5)
def is_allowed_cached(user_id: str) -> bool:
cached = _local_cache.get(user_id)
if cached is not None:
return cached # Use cached decision without Redis call
# Cache miss: go to Redis
result = redis_rate_limiter.is_allowed(user_id)
_local_cache[user_id] = result["allowed"]
return result["allowed"]
# Trade-off: User who hits limit might get 500ms more requests (cached "allowed")
# Acceptable for general APIs. Not acceptable for payment APIs.Option 3: Async rate limiting for non-critical endpoints
import asyncio
from concurrent.futures import ThreadPoolExecutor
# For endpoints where slight over-limit is acceptable:
# Check rate limit asynchronously, don't block the request
async def rate_limit_async(user_id: str, handler):
# Start handler immediately (no blocking)
handler_task = asyncio.create_task(handler())
# Check rate limit concurrently
rl_task = asyncio.create_task(redis_check_async(user_id))
result = await handler_task
rl_result = await rl_task
if not rl_result["allowed"]:
# Request already processed, but log the violation
# Could add to a "debt" bucket to reduce future allowance
metrics.increment("rate_limit.post_hoc_denial")
return resultOption 4: Rate limiting at a higher layer
Move rate limiting from application code to Nginx or API Gateway.
Nginx limit_req module performs rate limiting in C, without Python GIL,
without Redis round trips (local in Nginx shared memory), sub-millisecond.
For user-level limits that still need Redis: shard Redis by user ID hash.
Challenge 3: Split-Brain During Redis Failover
What Happened
A Redis Sentinel setup with 1 primary and 2 replicas experienced a network partition.
Sentinel promoted replica-A to primary because it could no longer see the original primary.
For 12 seconds, both old-primary and replica-A accepted writes (split-brain).
Rate limit counters diverged. After partition healed, users reported they had been
allowed 2x their rate limit during the incident window.
Technical Breakdown
Before partition:
Primary: counter = 45 (45 requests processed)
Replica-A: counter = 44 (1 write behind due to async replication lag)
During partition (12 seconds):
Primary (isolated): receives 30 more requests -> counter = 75
Replica-A (promoted): receives 30 requests -> counter = 74 (starts from 44)
Total allowed = 60 requests instead of 55 (limit = 100 - no breach, but diverged)
After partition heals:
Replica-A wins (it's the new primary)
Old primary demoted to replica, syncs from new primary
Old primary loses its 30 new writes
Counter = 74 (not 75)
Those 30 requests from old primary are "forgotten"
Users who made requests to old primary get "extra" allowance
The Fix
Structural fix: Require write majority before accepting
Redis Cluster (not Sentinel) with min-replicas-to-write 1 configuration:
# Redis config
min-replicas-to-write 1
min-replicas-max-lag 10
# With this config: primary REFUSES writes if fewer than 1 replica is connected
# During partition: isolated primary cannot accept rate limit INCs
# Users get errors (not double allowance)
# Trade-off: reduced availability during partition vs correctness
Accept the inconsistency with monitoring:
For most APIs, 2x limit for 12 seconds during a rare Redis failover is acceptable.
The right response is:
class RateLimiterWithIncidentAwareness:
def is_allowed(self, identifier: str) -> dict:
result = self.redis_limiter.is_allowed(identifier)
result["data_source"] = "primary"
# If we know we are in a post-failover state, note it
if self._is_post_failover_window():
result["note"] = "post_failover_approximate"
metrics.increment("rate_limit.post_failover_check")
return result
def _is_post_failover_window(self) -> bool:
# Sentinel notifies app of failover via pub/sub
# App sets a flag for 60 seconds after failover event
return self._failover_detected_at is not None and \
time.time() - self._failover_detected_at < 60Challenge 4: IP Rotation Attack Bypasses Per-IP Limits
What Happened
A public API enforcing 100 requests/minute per IP was targeted by a scraper with access
to a /14 IPv4 block (262,144 IP addresses). The attacker sent ~5 requests per IP per
minute, staying under each IP's individual limit. Effective throughput: ~22,000 RPS.
The system ran out of database connections within 3 minutes.
Warning Signs
- Many unique IPs each making exactly 1-5 requests, then never seen again
- All source IPs from the same ASN (Autonomous System Number) or IP range
- Requests arrive at a suspiciously uniform rate (1 req/IP every 200ms = scripted)
- Content being scraped shows clear pattern (all product listings, all prices)
Diagnosis
# Find top ASNs by request count in nginx logs
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -20
# If top IPs are from same /24 block, it's likely a rotating attack
# Check with whois
whois 203.0.113.0 | grep -E "netname|country|org"
# Count unique IPs per 5-minute window
awk '{print $1, $4}' access.log | awk '{print substr($2, 2, 16), $1}' | sort | uniq | wc -lThe Fix
Layer 1: ASN-level blocking at CDN
Cloudflare: Security -> WAF -> Tools -> IP Access Rules
Action: Block
Value: AS12345 (the ASN of the attacker)
Notes: Only do this if the entire ASN is malicious. ISPs share ASNs with many customers.
Layer 2: Behavioral fingerprinting
class BehavioralRateLimiter:
"""
Detects IP rotation attacks by tracking behavioral signals across requests.
"""
def get_request_fingerprint(self, request) -> str:
"""
Combine signals that are hard to rotate:
- TLS fingerprint (JA3 hash): hard to change per request
- HTTP/2 settings fingerprint: browser-specific, hard to fake
- Accept-Language, Accept-Encoding: usually consistent per client
- User-Agent: easy to fake but often consistent for bot libraries
"""
import hashlib
ja3 = request.headers.get("X-JA3-Fingerprint", "unknown")
ua = request.headers.get("User-Agent", "")[:50]
accept = request.headers.get("Accept", "")[:20]
lang = request.headers.get("Accept-Language", "")[:10]
fingerprint_raw = f"{ja3}:{ua}:{accept}:{lang}"
return hashlib.sha256(fingerprint_raw.encode()).hexdigest()[:16]
def is_allowed_with_fingerprint(self, request) -> bool:
ip = get_real_ip(request)
fingerprint = self.get_request_fingerprint(request)
# Rate limit by fingerprint regardless of IP
fp_key = f"rl:fp:{fingerprint}"
fp_count = redis.incr(fp_key)
redis.expire(fp_key, 60)
if fp_count > 100:
return False # Same browser/tool, rotating IPs -> blocked
# Also check per-IP
ip_key = f"rl:ip:{ip}"
ip_count = redis.incr(ip_key)
redis.expire(ip_key, 60)
if ip_count > 20:
return False
return TrueLayer 3: Rate limit by ASN subnet
import ipaddress
def get_subnet_key(ip: str, prefix_len_v4: int = 24, prefix_len_v6: int = 48) -> str:
ip_obj = ipaddress.ip_address(ip)
if ip_obj.version == 4:
network = ipaddress.ip_network(f"{ip}/{prefix_len_v4}", strict=False)
else:
network = ipaddress.ip_network(f"{ip}/{prefix_len_v6}", strict=False)
return f"rl:subnet:{network.network_address}"
# Rate limit by /24 subnet (allows 256 IPv4 addresses to share a limit)
# Even if attacker rotates within /24, they share the counter
subnet_key = get_subnet_key(client_ip)
subnet_count = redis.incr(subnet_key)
redis.expire(subnet_key, 60)Challenge 5: Carrier Grade NAT Blocking Thousands of Legitimate Users
What Happened
A mobile API using IP-based rate limiting started receiving customer complaints that
"the API is broken" for users on one major mobile carrier. Investigation revealed that
the carrier deployed Carrier Grade NAT (CGN), putting 50,000 mobile users behind 64
shared IP addresses. When any of those users hit the rate limit, all 50,000 were blocked.
Technical Details
CGN: 50,000 mobile users share 64 IP addresses
Per-IP limit: 100 requests/minute
User A sends 100 requests -> IP 203.0.113.1 counter = 100 -> limit hit
User A gets 429.
But also User B, C, D... through User N (all sharing 203.0.113.1) get 429.
A single user affects 781 other users on the same IP.
The Fix
Primary fix: Move to user-based rate limiting
This is the definitive solution. Require authentication and rate limit by user ID.
def get_rate_limit_identifier(request) -> tuple[str, str]:
"""
Returns (identifier, identifier_type).
Prefer specific identifiers over generic ones.
"""
# 1. Best: Authenticated user ID (from JWT or session)
user_id = extract_user_id_from_auth(request)
if user_id:
return f"user:{user_id}", "user"
# 2. Good: API key
api_key = request.headers.get("X-API-Key")
if api_key:
return f"apikey:{hash_key(api_key)}", "apikey"
# 3. Fallback: IP (but with higher limit to account for NAT)
ip = get_real_ip(request)
return f"ip:{ip}", "ip"
IP_BASED_LIMIT = 500 # Higher limit for IP (accounts for NAT sharing)
USER_BASED_LIMIT = 100 # Tighter limit for identified users
def get_limit_for_identifier(identifier_type: str) -> int:
return IP_BASED_LIMIT if identifier_type == "ip" else USER_BASED_LIMITDetect known CGN ranges and apply higher limits:
# Known CGN address ranges (RFC 6598: 100.64.0.0/10)
CGN_RANGES = ["100.64.0.0/10"]
MOBILE_CARRIER_RANGES = [
# Load from a database of known shared NAT ranges
# These change frequently - use a service like MaxMind or IP2Location
]
def is_cgn_address(ip: str) -> bool:
ip_obj = ipaddress.ip_address(ip)
return any(
ip_obj in ipaddress.ip_network(r)
for r in CGN_RANGES + MOBILE_CARRIER_RANGES
)
def get_ip_limit(ip: str) -> int:
if is_cgn_address(ip):
return 5_000 # Shared NAT: 50x higher limit
return 100 # Regular IPChallenge 6: Clock Skew Causing Window Boundary Gaps
What Happened
A fixed-window rate limiter showed a strange pattern: users could make slightly more than
their 100-request/minute limit when the requests were sent near the 60-second boundary.
Forensic analysis of logs showed that 3 of 10 application servers had clocks drifted by
800ms-1200ms from the others.
Root Cause
Server A clock: 12:00:00.000 (on time)
Server B clock: 12:00:00.900 (900ms ahead)
At wall clock time 11:59:59.500:
Server A says: "Still in window 11:59:00-12:00:00" -> increments window A counter
Server B says: "Already in window 12:00:00-12:01:00" -> increments window B counter
User sends 60 requests in the last 500ms of the minute, routed between A and B:
30 requests -> Server A -> window A counter = 30
30 requests -> Server B -> window B counter = 30
User sends 60 requests in first 500ms of new minute:
Same split: window A goes to 60, window B resets and goes to 30
Effective requests allowed in "same minute": 90 (30+30+30), not 60.
The 10 extra requests (the gap at the boundary) come from the clock disagreement.
The Fix
Use Redis server time in Lua scripts:
-- BROKEN: Uses client-supplied time (from a potentially drifted server)
local now = tonumber(ARGV[1]) -- application server time (may be drifted)
local window_id = math.floor(now / 60)
-- CORRECT: Use Redis server time (single source of truth)
local time_result = redis.call('TIME')
local now = tonumber(time_result[1]) + tonumber(time_result[2]) / 1e6
local window_id = math.floor(now / 60)
-- All rate limit checks on this Redis instance use identical time
-- Clock skew between application servers no longer mattersImportant caveat: redis.call('TIME') makes the Lua script non-deterministic,
which can cause issues with Redis replication in some setups. The recommended approach:
- For Redis Standalone: Use
TIMEin Lua - For Redis Cluster/Sentinel: Pass time from client but ensure NTP synchronization
- Monitor clock drift:
chronyc tracking | grep "System time"
Challenge 7: Thundering Herd After Maintenance Window
What Happened
A major API platform performed scheduled maintenance from 2:00 AM to 2:30 AM. The
maintenance page returned 503. All clients backed off. At exactly 2:30 AM when service
resumed, 200,000 clients simultaneously sent their pent-up requests. The database
connection pool exhausted in 4 seconds. Service fell over again immediately.
The Timeline
02:00 AM: Maintenance starts. Service returns 503.
02:00-02:30: Clients back off. Most are sleeping with Retry-After=1800 (30 minutes).
02:30 AM: Service restores. Returns 200 OK.
02:30:00: 200,000 clients wake up and hammer the API simultaneously.
02:30:04: DB connection pool exhausted (1000 connections, 200,000 requests = 200x capacity)
02:30:05: Service falls over again.
02:30:05-02:45: Cascading restart loop, escalating outage.
The Fix
Gradual ramp-up with a "recovery" rate limit:
class RecoveryAwareRateLimiter:
"""
During recovery from an outage, temporarily reduce effective limits
to give the system time to warm up.
"""
def __init__(self, normal_limiter, recovery_duration: int = 300):
self.normal_limiter = normal_limiter
self.recovery_duration = recovery_duration # 5-minute ramp
self._recovery_start = None
def signal_recovery_start(self):
self._recovery_start = time.time()
def get_recovery_multiplier(self) -> float:
if self._recovery_start is None:
return 1.0
elapsed = time.time() - self._recovery_start
if elapsed >= self.recovery_duration:
self._recovery_start = None
return 1.0
# Linear ramp: 10% at recovery start, 100% after 5 minutes
progress = elapsed / self.recovery_duration
return 0.1 + (0.9 * progress)
def is_allowed(self, identifier: str, base_limit: int) -> bool:
multiplier = self.get_recovery_multiplier()
effective_limit = max(1, int(base_limit * multiplier))
return self.normal_limiter.is_allowed(identifier, limit=effective_limit)Staggered Retry-After during maintenance:
def maintenance_response(request) -> Response:
# Give clients DIFFERENT retry times (jittered) so they don't all come back at once
user_id = extract_user_id(request) or request.remote_addr
# Deterministic jitter: same user always gets same offset (consistent experience)
user_hash = int(hashlib.sha256(user_id.encode()).hexdigest(), 16)
base_retry = 1800 # 30 minutes until maintenance ends
jitter = user_hash % 600 # Up to 10 minutes of jitter
retry_after = base_retry + jitter
return Response(
status=503,
headers={
"Retry-After": str(retry_after),
"X-Maintenance-End": "2025-01-01T02:30:00Z"
}
)Challenge 8: Marketing Campaign Triggers False Positive Rate Limiting
What Happened
A company launched a product on Hacker News (HN front page). Within 20 minutes, 50,000
unique users visited the site simultaneously. 80% of those came from a small set of IPs
(HN link aggregator bots, VPN services, corporate proxies). The IP-based rate limiter
blocked most of the traffic, killing the viral moment. The CEO was furious.
The Fix
Shadow mode before launch:
# During the 48 hours before a campaign launch, run in shadow mode:
class ShadowRateLimiter:
def __init__(self, limiter):
self.limiter = limiter
self.shadow_log = []
def is_allowed_shadow(self, identifier: str) -> bool:
result = self.limiter.is_allowed(identifier)
if not result["allowed"]:
# Would have been denied - log it, but allow anyway
self.shadow_log.append({
"time": time.time(),
"identifier": identifier,
"count": result["count"],
"limit": result["limit"]
})
return True # Allow in shadow mode
return TrueTier the response instead of hard block:
def handle_request_gracefully(request) -> Response:
identifier, id_type = get_rate_limit_identifier(request)
result = rate_limiter.check(identifier)
if not result["allowed"] and id_type == "ip":
# Under heavy load: serve cached/degraded response instead of 429
cached_response = cache.get_stale(request.path)
if cached_response:
return Response(
body=cached_response,
status=200,
headers={
"X-Cache": "STALE",
"X-RateLimit-Warning": "Near limit, serving cached response"
}
)
return normal_handler(request)Challenge 9: Blue-Green Deployment Resets Rate Limit State
What Happened
A team used in-memory rate limiting (not Redis). Their deployment process was:
- Deploy new version (green) alongside old version (blue)
- Route 100% traffic to green
- Terminate blue
When green started, all in-memory rate limit counters were zero. Users who had just hit
their limit on blue were suddenly able to make fresh requests on green. Sophisticated
users discovered they could reset their rate limits by sending a request that triggered
a deployment or by waiting for the next deployment cycle.
The Fix
# WRONG: In-memory state resets on deployment
class InMemoryRateLimiter:
def __init__(self):
self._counters = {} # LOST on every deployment
# CORRECT: All state in Redis (survives any number of deployments)
class RedisRateLimiter:
def __init__(self, redis_client):
self.r = redis_client
# State lives in Redis, not in the application process.
# Blue-green, rolling, canary deployments all work transparently.
# Even if ALL instances are restarted simultaneously, state survives.For the transition period (migrating from in-memory to Redis):
# Dual-write during migration: write to both, read from Redis
class MigratingRateLimiter:
def __init__(self, local_limiter, redis_limiter):
self.local = local_limiter
self.redis = redis_limiter
def is_allowed(self, identifier: str) -> bool:
local_result = self.local.is_allowed(identifier)
redis_result = self.redis.is_allowed(identifier)
# During migration: use Redis result (more accurate)
# Log when they disagree
if local_result != redis_result["allowed"]:
logger.info("migration_disagreement", identifier=identifier,
local=local_result, redis=redis_result["allowed"])
return redis_result["allowed"]Challenge 10: Database Connection Pool Exhaustion from Rate Limit Metadata Queries
What Happened
The rate limiter needed to look up each user's subscription tier to determine their limit.
This tier lookup queried the main PostgreSQL database on every request. At 5,000 RPS,
this generated 5,000 DB queries/second just for rate limit tier lookups. The database
connection pool (50 connections) was shared with the application. Rate limit lookups
consumed 30+ connections, leaving insufficient connections for actual business logic.
P99 latency for the entire API went from 50ms to 800ms.
Root Cause
5,000 RPS
x 1 tier lookup per request
x 5ms avg DB query time
= 25 CPU-seconds/second of DB work just for rate limits
+ 50 DB connections shared with application
= DB connection pool exhausted
The Fix
Cache tier data in Redis, not in the DB hot path:
import redis
import json
from typing import Optional
TIER_CACHE_TTL = 300 # Cache tier for 5 minutes
def get_user_tier_cached(user_id: str, r: redis.Redis, db) -> str:
cache_key = f"tier:{user_id}"
# Check Redis cache first (microseconds, not milliseconds)
cached = r.get(cache_key)
if cached:
return cached
# Cache miss: hit DB (happens at most once per 5 minutes per user)
tier = db.query("SELECT tier FROM users WHERE id = %s", user_id)
r.setex(cache_key, TIER_CACHE_TTL, tier)
return tier
# At 5,000 RPS with 100,000 unique users and 5-minute TTL:
# Cache hit rate: ~99.9% (5000 req/s x 300s = 1,500,000 requests per TTL period)
# (100,000 users / 1,500,000 requests = 0.07% cache miss rate)
# DB queries for tier lookup: ~3.5/second (not 5,000/second)
# DB connection usage for rate limiting: ~0 (not 30+)Even better: Pre-warm tier cache on login/token refresh:
def generate_token(user_id: str) -> str:
tier = db.get_user_tier(user_id)
token = create_jwt({"user_id": user_id, "tier": tier})
# Cache tier at login time - invalidate on plan change
r.setex(f"tier:{user_id}", 3600, tier)
return token
# Now tier is always in cache when user makes API calls (they just logged in)
# DB is only hit when token is generated, not on every API callChallenge 11: Internal Service Fan-Out DoS
What Happened
Service A (user-facing API) called Service B (data enrichment service) for every user
request. Service B was designed assuming 100 RPS from Service A. One day, Service A
launched a new feature that made Service A call Service B 10 times per user request
(fetching 10 data fields instead of 1). Service A received 200 RPS from users.
Service B now received 2,000 RPS instead of 200 RPS. Service B fell over.
Service A started timing out. Users saw 500 errors for 40 minutes until Service B was
scaled up.
The Fix
Outbound rate limiting at the caller:
@Service
public class EnrichmentClient {
// Outbound rate limiter: limit calls TO Service B
private final Bucket outboundBucket = Bucket4j.builder()
.addLimit(Bandwidth.classic(500, Refill.greedy(500, Duration.ofSeconds(1))))
.build();
public List<EnrichmentData> enrich(List<String> ids) {
// Batch: instead of N calls, send 1 call with N IDs
// (Requires Service B to have a bulk endpoint)
return batchFetch(ids);
}
public EnrichmentData enrichOne(String id) {
// Rate-limited outbound call
if (!outboundBucket.tryConsume(1)) {
// Outbound limit hit: return cached/default data instead of erroring
return getDefaultEnrichment(id);
}
return serviceB.get(id);
}
}Contract testing between services:
# Service B publishes its rate limit contract
# service-b-contract.yaml
rate_limit:
max_rps: 500
max_concurrent_requests: 50
burst_size: 100
# Service A MUST configure its outbound limiter based on Service B's contract
# Contract testing: if Service B changes its limit, Service A's tests catch itChallenge 12: Background Job Bug Exhausts Entire Daily Quota
What Happened
A data synchronization job had a bug: it had an infinite retry loop when it encountered
a specific error condition. The bug was not caught in testing because the error condition
only occurred with production data. The job started at midnight and by 1 AM had consumed
the entire month's API quota for a paid third-party service. The company received a $12,000
overage bill.
The Fix
Separate quotas for automated vs human callers:
class SegregatedQuotaManager:
QUOTA_SEGMENTS = {
"human_interactive": {"daily": 100_000, "burn_rate_alert": 5_000}, # per hour
"batch_jobs": {"daily": 500_000, "burn_rate_alert": 50_000}, # per hour
"sync_service": {"daily": 200_000, "burn_rate_alert": 10_000}, # per hour
}
def check_quota(self, segment: str) -> bool:
daily_key = f"quota:{segment}:{self._today()}"
count = self.r.incr(daily_key)
if count == 1:
self.r.expire(daily_key, 86400 * 2)
limit = self.QUOTA_SEGMENTS[segment]["daily"]
if count > limit:
self.alert(f"Quota exhausted for {segment}")
return False
# Burn rate alert
hourly_key = f"quota_hr:{segment}:{self._current_hour()}"
hourly = int(self.r.get(hourly_key) or 0)
alert_threshold = self.QUOTA_SEGMENTS[segment]["burn_rate_alert"]
if hourly > alert_threshold:
self.alert(f"High burn rate for {segment}: {hourly}/hour (alert at {alert_threshold})")
return TrueCircuit breaker on infinite-loop detection:
class JobRateLimiter:
"""
Detects runaway jobs by tracking per-job-instance request rate.
"""
def __init__(self, r, max_rpm_per_job: int = 60):
self.r = r
self.max_rpm = max_rpm_per_job
def check_job(self, job_id: str) -> bool:
key = f"job_rl:{job_id}:{int(time.time())//60}"
count = self.r.incr(key)
self.r.expire(key, 120)
if count > self.max_rpm:
# This job instance is making >60 requests/minute - likely a bug
self.r.set(f"job_killed:{job_id}", 1, ex=3600)
alert_ops(f"Job {job_id} killed: {count} RPM (max {self.max_rpm})")
return False
# Check if this job has been killed
if self.r.exists(f"job_killed:{job_id}"):
return False
return TrueChallenge 13: Multi-Tenant Quota Bleeding
What Happened
A SaaS platform had 100 customers sharing infrastructure, including a shared "API quota"
for a third-party AI service. One large customer (paying 500/month each) could no longer use the AI features for the rest of the day.
The support queue received 200 tickets. Churn rate doubled that week.
The Fix
Hard per-tenant resource isolation:
class TenantQuotaManager:
"""
Each tenant has completely isolated quotas.
One tenant cannot consume another tenant's resources.
"""
def __init__(self, r, plan_quotas: dict):
self.r = r
self.plan_quotas = plan_quotas
def check_tenant_quota(self, tenant_id: str, resource: str) -> dict:
plan = self.get_tenant_plan(tenant_id)
quota = self.plan_quotas[plan][resource]
today = time.strftime("%Y-%m-%d")
key = f"quota:tenant:{tenant_id}:{resource}:{today}"
count = self.r.incr(key)
if count == 1:
self.r.expire(key, 86400 * 2)
return {
"allowed": count <= quota,
"used": count,
"limit": quota,
"remaining": max(0, quota - count),
"tenant": tenant_id # Each tenant has their OWN counter
}
def get_tenant_quota_usage(self, tenant_id: str) -> dict:
today = time.strftime("%Y-%m-%d")
pattern = f"quota:tenant:{tenant_id}:*:{today}"
usage = {}
for key in self.r.scan_iter(pattern):
resource = key.split(":")[3]
usage[resource] = int(self.r.get(key) or 0)
return usageChallenge 14: Lua Script Timeout Under Load
What Happened
A rate limiter used a Lua script to implement GCRA (Generic Cell Rate Algorithm). The
script worked perfectly for months. During a traffic spike (5x normal), Redis CPU
hit 100% and started killing Lua scripts that exceeded the lua-time-limit (5 seconds).
All rate limit checks started returning errors. The fail-open policy kicked in. For
8 minutes, there were no rate limits at all.
Root Cause
Normal: 1,000 Lua executions/second -> Redis CPU 20%
Spike: 5,000 Lua executions/second -> Redis CPU 100%
Each Lua execution blocks Redis for 50-200 microseconds
At 100% CPU: executions queue up -> execution time grows -> timeout hit
The Fix
Profile and simplify the Lua script:
# Redis slow log for Lua
redis-cli CONFIG SET slowlog-log-slower-than 500 # 0.5ms threshold
redis-cli SLOWLOG GET 25
# Benchmark your Lua script directly
redis-cli --latency-history -i 1
redis-cli DEBUG SLEEP 0 # Measures baseline latency
# Benchmark specific Lua script
redis-cli EVAL "return redis.call('SET', KEYS[1], ARGV[1])" 1 testkey testvalReplace complex GCRA Lua with simpler sliding window counter:
-- BEFORE: Complex GCRA with floating point math (slower)
local tat = tonumber(redis.call('GET', KEYS[1]) or '0')
local now = tonumber(ARGV[1])
local emission_interval = tonumber(ARGV[2])
local burst_offset = tonumber(ARGV[3])
local new_tat = math.max(now, tat) + emission_interval
local allowed_at = new_tat - burst_offset
if allowed_at <= now then
redis.call('SETEX', KEYS[1], tonumber(ARGV[4]), tostring(new_tat))
return {1, new_tat - now}
end
return {0, allowed_at - now}
-- Execution time: ~150 microseconds
-- AFTER: Simple sliding window counter (faster)
local k1 = KEYS[1] .. ':' .. math.floor(tonumber(ARGV[3]) / tonumber(ARGV[2]))
local k2 = KEYS[1] .. ':' .. (math.floor(tonumber(ARGV[3]) / tonumber(ARGV[2])) - 1)
local c1 = tonumber(redis.call('GET', k1) or '0')
local c2 = tonumber(redis.call('GET', k2) or '0')
local w = (tonumber(ARGV[3]) % tonumber(ARGV[2])) / tonumber(ARGV[2])
if c1 + c2 * (1 - w) < tonumber(ARGV[1]) then
redis.call('INCR', k1)
redis.call('EXPIRE', k1, tonumber(ARGV[2]) * 2)
return 1
end
return 0
-- Execution time: ~50 microseconds. 3x faster.Challenge 15: Missing Retry-After Causes Self-Inflicted Retry Storm
What Happened
An internal service received 429 responses from a dependency service that had just
deployed a new rate limiter. The new rate limiter returned 429 but did NOT include the
Retry-After header. The calling service had this logic:
if response.status_code == 429:
time.sleep(1) # Default: retry after 1 second
return self.retry(request)The rate limit window was 60 seconds. Clients retried every 1 second, generating 60
retries per blocked request. The dependency service now received 60x more requests from
the retry storm than from original traffic. Its rate limiter blocked even more traffic.
A feedback loop: more rate limiting -> more retries -> more rate limiting.
The Fix
Always include Retry-After:
def handle_rate_limited_response(identifier: str, reset_at: int) -> Response:
now = int(time.time())
retry_after = max(1, reset_at - now)
# Optional: Add jitter to stagger retries
retry_after_jittered = retry_after + random.randint(0, 10)
return Response(
status=429,
headers={
"Retry-After": str(retry_after_jittered),
"X-RateLimit-Reset": str(reset_at),
"X-RateLimit-Remaining": "0"
},
body={
"error": "rate_limit_exceeded",
"retry_after": retry_after_jittered
}
)Calling service: respect Retry-After, use backoff:
def call_with_retry(func, max_retries=5):
for attempt in range(max_retries):
response = func()
if response.status_code == 429:
# Use server's Retry-After, with fallback to exponential backoff
retry_after = response.headers.get("Retry-After")
if retry_after:
wait = float(retry_after)
else:
wait = min(300, (2 ** attempt) + random.uniform(0, 5)) # jitter
if attempt < max_retries - 1:
time.sleep(wait)
else:
return response
raise MaxRetriesExceeded()Challenge 16: Hot Key Problem in Redis Cluster
What Happened
A public API where a viral post made one specific content creator have 1,000x normal
traffic. All requests for that creator's content were rate-limited by the same Redis key:
rl:user:{creator_id}. That key was on shard 3 of a 10-node Redis Cluster. Shard 3
was at 100% CPU while all other shards were at 8%. p99 latency for all users of shard 3
(not just this creator) was 300ms instead of the normal 2ms.
The Fix
Local sharding for hot keys:
import hashlib
import random
NUM_VIRTUAL_SHARDS = 10
def get_sharded_key(base_key: str, shards: int = NUM_VIRTUAL_SHARDS) -> str:
"""
For hot keys: distribute across N virtual shards to spread load.
The actual count requires summing all shards (approximate).
"""
shard_id = random.randint(0, shards - 1)
return f"{base_key}:shard:{shard_id}"
def get_total_count_sharded(base_key: str, r, shards: int = NUM_VIRTUAL_SHARDS) -> int:
"""Get approximate total count across all shards."""
pipe = r.pipeline()
for i in range(shards):
pipe.get(f"{base_key}:shard:{i}")
results = pipe.execute()
return sum(int(v or 0) for v in results)
# For the hot creator: 10 Redis keys instead of 1
# Spread across 10 shards -> 10x load distribution
# Trade-off: reading total count requires 10 GET operations (batched in pipeline)Per-endpoint limits for specific hot users:
def get_limit_for_hot_user(user_id: str) -> int:
# Detect hot users: if their per-minute count exceeds normal by 100x
# Automatically route them to a higher limit (so their requests don't
# clog the rate limit check for other users)
if is_hot_user(user_id):
return get_enterprise_limit(user_id) # higher limit, separate key space
return get_standard_limit(user_id)Challenge 17: Rate Limit Bypass via HTTP Method Variation
What Happened
An API limited GET /api/users to 100 RPM. A developer discovered that the rate limiter
checked HTTP method + path together. POST /api/users was used for user creation and had
a different (looser) rate limit. The developer sent POST /api/users with a search body
to do pagination reads, bypassing the GET limit.
The Fix
Rate limit by normalized resource path, not HTTP method:
RESOURCE_LIMITS = {
"/api/users": 100, # All methods share this limit
"/api/products": 500,
"/api/search": 30,
"/api/auth/login": 5,
}
def get_resource_key(request) -> str:
# Normalize: strip method, normalize path params
path = normalize_path(request.path)
# All methods to the same resource share one counter
return f"rl:resource:{user_id}:{path}"
# GET /api/users and POST /api/users both decrement the same "users" counter
# Method-specific limits still possible via a separate higher-cost counter:
WRITE_ADDITIONAL_COST = {
"POST": 2, # Writes cost 2x reads
"PUT": 2,
"DELETE": 3,
"PATCH": 1,
"GET": 1,
"HEAD": 0, # No cost
}Challenge 18: False Positive During Feature Flag Rollout
What Happened
A new "smart refresh" feature for mobile clients was rolled out via feature flags to 1%
of users. The feature caused the mobile app to poll an API 10x more frequently to keep
content fresh. That 1% of users immediately hit their rate limits. The team interpreted
this as a bug in the mobile app, not a rate limit issue, and spent 4 hours debugging
mobile code before realizing the feature simply needed higher rate limits.
The Fix
Pre-flight rate limit impact analysis for feature flags:
class FeatureFlagRateLimitAnalyzer:
"""
Before enabling a feature flag, estimate its rate limit impact.
"""
def estimate_impact(
self,
feature_name: str,
api_calls_per_user_per_hour: int,
rollout_percentage: float,
total_active_users: int
) -> dict:
affected_users = int(total_active_users * rollout_percentage)
additional_rps = (affected_users * api_calls_per_user_per_hour) / 3600
user_usage_pct = api_calls_per_user_per_hour / USER_HOURLY_LIMIT * 100
return {
"feature": feature_name,
"affected_users": affected_users,
"additional_rps": additional_rps,
"user_rate_limit_usage_pct": user_usage_pct,
"will_hit_user_limit": user_usage_pct > 80,
"recommendation": (
"Increase user limit before rollout"
if user_usage_pct > 80 else "Safe to roll out"
)
}
# Before rolling out "smart refresh":
analysis = analyzer.estimate_impact(
feature_name="smart_refresh",
api_calls_per_user_per_hour=600, # 10x polling, 60 polls/hour * 10 = 600
rollout_percentage=0.01,
total_active_users=500_000
)
# Would have shown: user_rate_limit_usage_pct = 120% -> will_hit_user_limit = True
# Team would have adjusted limits before rolloutChallenge 19: WebSocket Orphan Connection Accumulation
What Happened
A real-time messaging service had a limit of 5 concurrent WebSocket connections per user.
Mobile clients that lost network connectivity did not send a CLOSE frame. The server-side
connections lingered for hours (no timeout configured). Users on unstable mobile connections
hit their 5-connection limit after switching between WiFi and cellular a few times. They
could no longer connect to the messaging service. Support tickets: "App is broken."
The Fix
class WebSocketConnectionManager:
"""
Track and enforce concurrent WebSocket limits with TTL-based cleanup.
"""
MAX_CONNECTIONS_PER_USER = 5
CONNECTION_TTL = 300 # 5-minute heartbeat window
def add_connection(self, user_id: str, connection_id: str) -> bool:
key = f"ws:conns:{user_id}"
conn_ttl_key = f"ws:conn:{connection_id}:ttl"
# Cleanup stale connections first
self._cleanup_stale(user_id)
# Check current count
count = int(self.r.scard(key) or 0)
if count >= self.MAX_CONNECTIONS_PER_USER:
return False
# Add connection
self.r.sadd(key, connection_id)
self.r.expire(key, self.CONNECTION_TTL * 2)
self.r.setex(conn_ttl_key, self.CONNECTION_TTL, "alive")
return True
def heartbeat(self, connection_id: str) -> None:
"""Client sends heartbeat every 60 seconds to keep connection alive."""
conn_ttl_key = f"ws:conn:{connection_id}:ttl"
self.r.expire(conn_ttl_key, self.CONNECTION_TTL)
def remove_connection(self, user_id: str, connection_id: str) -> None:
key = f"ws:conns:{user_id}"
self.r.srem(key, connection_id)
self.r.delete(f"ws:conn:{connection_id}:ttl")
def _cleanup_stale(self, user_id: str) -> None:
"""Remove connections whose heartbeat TTL has expired."""
key = f"ws:conns:{user_id}"
all_connections = self.r.smembers(key)
for conn_id in all_connections:
ttl_key = f"ws:conn:{conn_id}:ttl"
if not self.r.exists(ttl_key):
# TTL expired = heartbeat not received = stale connection
self.r.srem(key, conn_id)Challenge 20: Multi-Region Request Duplication Exhausting Limits
What Happened
A global API with US-EAST and EU-WEST regions used cross-region active-active replication.
A mobile client on a flaky connection sometimes had requests processed in BOTH regions
(the first region processed the request but the response was lost; the client retried and
hit the second region). The user's rate limit counter was incremented TWICE for one logical
request. At 50 requests/minute, a user on a flaky mobile network could hit their 100
request/minute limit after only 50 actual logical requests.
The Fix
Idempotency key-based counting:
class IdempotencyAwareRateLimiter:
"""
Uses idempotency keys to ensure retries of the same request
do not consume additional rate limit tokens.
"""
IDEMPOTENCY_TTL = 300 # 5 minutes
def check_and_consume(
self,
user_id: str,
idempotency_key: str = None,
cost: int = 1
) -> dict:
if idempotency_key:
# Check if we've already processed this idempotency key
idem_redis_key = f"idem:{user_id}:{idempotency_key}"
existing = self.r.get(idem_redis_key)
if existing:
# Already counted - return same result, no new charge
result = json.loads(existing)
result["idempotent_replay"] = True
return result
# New request: check rate limit
result = self.rate_limiter.check(user_id, cost=cost)
if idempotency_key and result["allowed"]:
# Store result for idempotency replay
self.r.setex(
f"idem:{user_id}:{idempotency_key}",
self.IDEMPOTENCY_TTL,
json.dumps(result)
)
return resultClient-side: generate idempotency key once per logical request:
import uuid
class ResilientAPIClient:
def make_request(self, method: str, path: str, body: dict = None) -> dict:
# Generate idempotency key ONCE for this logical request
idempotency_key = str(uuid.uuid4())
for attempt in range(5):
response = requests.request(
method, f"{self.base_url}{path}",
json=body,
headers={
"X-Idempotency-Key": idempotency_key, # Same key on every retry
"Authorization": f"Bearer {self.token}"
}
)
if response.status_code != 429:
return response.json()
retry_after = float(response.headers.get("Retry-After", 2 ** attempt))
time.sleep(retry_after + random.uniform(0, 1))
raise MaxRetriesError()Production Challenge Quick Reference
| Challenge | Primary Cause | Detection Signal | Key Fix |
|---|---|---|---|
| Memory explosion | Wrong algorithm at scale | Redis memory climbing | Switch to O(1) algorithm |
| Rate limiter bottleneck | Too many Redis ops | p99 latency spike | Pipeline + local cache |
| Split-brain failover | Redis replication lag | 2x traffic during failover | Accept or use Cluster quorum |
| IP rotation attack | ASN-wide attack | Many IPs, same ASN, low per-IP | Fingerprinting + ASN blocking |
| CGN blocking | Shared IP for thousands | One carrier's users all blocked | User-based limiting |
| Clock skew | NTP drift | Boundary-case over-limit | Redis TIME in Lua |
| Thundering herd | Synchronized retry | Spike at maintenance end | Jittered Retry-After |
| False positive campaign | IP-based + viral traffic | Support spike + social media | Behavioral signals + soft limits |
| Deployment reset | In-memory state | Limits reset on deploy | Redis-backed state |
| DB pool exhaustion | Tier lookup in hot path | DB connections spike | Cache tier in Redis |
| Internal fan-out DoS | 1 request -> N internal calls | Service B overwhelmed by A | Outbound limiting at caller |
| Job bug quota drain | Infinite loop | Quota burned in minutes | Separate job quotas + burn alert |
| Multi-tenant bleeding | Shared quota | One tenant blocks others | Per-tenant isolation |
| Lua timeout | Complex script | Redis BUSY errors | Simplify Lua |
| Retry storm | Missing Retry-After | 429 storm in logs | Always include Retry-After |
| Hot key CPU spike | Viral user | One Redis shard at 100% | Virtual key sharding |
| Method bypass | Method-specific limits | Unusual POST patterns | Resource-based (not method) limits |
| Feature flag impact | No pre-flight analysis | Sudden rate limit spike | Rate limit impact analysis |
| WebSocket orphan | No heartbeat timeout | Connection count climbs | TTL-based cleanup |
| Multi-region duplication | No idempotency | Flaky mobile double-counted | Idempotency keys in rate limits |
Next Supplement: Supplement 3 - Trade-Offs and Decision Guide