← Back to Articles
6/6/2026Admin Post

rate limiting supplement2 production challenges

Rate Limiting - Supplement 2: Real Production Challenges and Solutions

Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 3 - Trade-Offs and Decision Guide |
Supplement 4 - Architecture Patterns

Twenty real production scenarios where rate limiting goes wrong.
Each includes: what happened, warning signs, root cause, diagnosis steps,
and the concrete fix applied in production systems.


Table of Contents

  1. Redis Memory Explosion at Scale
  2. Rate Limiter Becomes the Bottleneck
  3. Split-Brain During Redis Failover
  4. IP Rotation Attack Bypasses Per-IP Limits
  5. Carrier Grade NAT Blocking Thousands of Legitimate Users
  6. Clock Skew Causing Window Boundary Gaps
  7. Thundering Herd After Maintenance Window
  8. Marketing Campaign Triggers False Positive Rate Limiting
  9. Blue-Green Deployment Resets Rate Limit State
  10. Database Connection Pool Exhaustion from Rate Limit Metadata Queries
  11. Internal Service Fan-Out DoS
  12. Background Job Bug Exhausts Entire Daily Quota
  13. Multi-Tenant Quota Bleeding
  14. Lua Script Timeout Under Load
  15. Missing Retry-After Causes Self-Inflicted Retry Storm
  16. Hot Key Problem in Redis Cluster
  17. Rate Limit Bypass via HTTP Method Variation
  18. False Positive During Feature Flag Rollout
  19. WebSocket Orphan Connection Accumulation
  20. Multi-Region Request Duplication Exhausting Limits

Challenge 1: Redis Memory Explosion at Scale

What Happened

A production API with 500,000 active users deployed sliding window log rate limiting
(using Redis Sorted Sets). Everything worked fine in testing (1,000 users).
When the real user base onboarded over 3 months, Redis memory climbed from 2GB to 18GB
and OOM-killed the Redis process at 3 AM.

Warning Signs

  • Redis memory usage growing linearly with user count
  • Redis INFO memory showing used_memory_human climbing every day
  • No corresponding drop in memory even when traffic drops overnight
  • DBSIZE command showing millions of keys

Root Cause

Sliding window log stores a timestamp per request. With limit=500 requests/minute and
500,000 users each making 200 requests/hour:

500,000 users x 200 requests/hour x ~20 bytes per ZADD entry = 2 GB/hour
Keys are kept for 2x the window = 2 minutes
So at steady state: 500,000 users x 200/60 requests/minute x 20 bytes x 2 = 66 MB
But! ZADD members are not evicted until the next request from that user arrives.
Users who stopped making requests keep their last-minute worth of entries forever.
Active users: 500,000. But total unique users who have EVER made a request: 5,000,000.
5,000,000 users x 500 entries x 20 bytes = 50 GB. OOM.

Diagnosis Steps

# Check Redis memory breakdown
redis-cli INFO memory
 
# Count rate limit keys
redis-cli SCAN 0 MATCH "rl:swl:*" COUNT 10000 | head -20
 
# Check key count with pattern
redis-cli --scan --pattern "rl:swl:*" | wc -l
 
# Check memory usage of one key
redis-cli MEMORY USAGE "rl:swl:user123"
 
# Find largest keys (use carefully in production - SCAN is O(n))
redis-cli --bigkeys
 
# Check TTL of a key (should be 2x window, not -1 which means no TTL)
redis-cli TTL "rl:swl:user123"

The Fix

Immediate (stop the bleeding):

# Set maxmemory and eviction policy
redis-cli CONFIG SET maxmemory 8gb
redis-cli CONFIG SET maxmemory-policy allkeys-lru
# LRU eviction: least recently used keys are evicted first
# Rate limit keys for inactive users will be evicted before application data

Short-term (reduce memory usage):

# Switch from Sliding Window Log to Sliding Window Counter (O(1) memory per user)
# Old: O(limit) = O(500) per user
# New: O(1) - two integers per user
 
# Lua script: sliding window counter (2 integer keys per user, not 500 sorted set entries)
COUNTER_SCRIPT = """
local base = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local win_id = math.floor(now / window)
local curr_key = base .. ':c:' .. win_id
local prev_key = base .. ':c:' .. (win_id - 1)
local elapsed_frac = (now % window) / window
local curr = tonumber(redis.call('GET', curr_key) or '0')
local prev = tonumber(redis.call('GET', prev_key) or '0')
local estimate = prev * (1 - elapsed_frac) + curr
if estimate < limit then
    redis.call('INCR', curr_key)
    redis.call('EXPIRE', curr_key, window * 2)
    return {1, math.floor(estimate) + 1}
end
return {0, math.floor(estimate)}
"""
# Memory: 500,000 users x 2 keys x 30 bytes = 30 MB (vs 50 GB). 1666x reduction.

Long-term (capacity planning):

Memory per algorithm at 500K users with limit=500/min:
  Sliding Window Log:      50 GB    (avoid at scale)
  Sliding Window Counter:  30 MB    (use this)
  Token Bucket:            20 MB    (also fine)
  Fixed Window Counter:    15 MB    (also fine)

Rule: Always calculate memory at target scale BEFORE choosing algorithm.
Memory = users x entries_per_user x bytes_per_entry

Challenge 2: Rate Limiter Becomes the Bottleneck

What Happened

A payment processing API with strict latency SLAs (p99 < 100ms) deployed Redis-based
rate limiting. Under normal load everything was fine. During peak (holiday season,
10x traffic), p99 latency jumped from 80ms to 350ms. Investigation revealed that
Redis rate limit checks were taking 40-90ms at p99 due to Redis CPU saturation.

Warning Signs

  • API p99 latency spikes correlate exactly with rate limit check timing
  • Redis CPU usage consistently above 70%
  • Redis SLOWLOG shows rate limit commands taking >10ms
  • Rate limit checks showing in distributed traces as the slowest span

Root Cause

Normal load:   1,000 RPS   x 1 Redis round trip = 1,000 Redis ops/sec (Redis handles fine)
Peak load:    10,000 RPS   x 1 Redis round trip = 10,000 Redis ops/sec (Redis at 80% CPU)

Plus: Each rate limit check requires a pipeline of 2-3 commands
Plus: Lua script execution
Plus: Network RTT (2ms average, 45ms p99 under load due to queuing)

Diagnosis

# Check Redis operations per second
redis-cli INFO stats | grep instantaneous_ops_per_sec
 
# Check Redis CPU
redis-cli INFO cpu
 
# Check slow commands
redis-cli SLOWLOG GET 25
 
# Check command latency
redis-cli --latency-history -i 1
 
# Find most frequent rate limit operations
redis-cli MONITOR | grep "rl:" | head -100
# (Use MONITOR only briefly - it impacts Redis performance)

The Fix

Option 1: Pipeline batching (quick win)

# Before: 2 Redis round trips per request (INCR + EXPIRE)
count = r.incr(key)
r.expire(key, 60)
 
# After: 1 Redis round trip (pipeline executes both atomically-ish)
pipe = r.pipeline(transaction=False)  # transaction=False = no MULTI/EXEC overhead
pipe.incr(key)
pipe.expire(key, 60)
count, _ = pipe.execute()
# Reduces network round trips by 50%. Significant improvement.

Option 2: Local cache for non-critical checks

import time
from cachetools import TTLCache
 
# Cache rate limit decisions for 500ms
# For 10,000 RPS and 5,000 unique users, only 2 Redis calls/user/500ms instead of 10
_local_cache = TTLCache(maxsize=50_000, ttl=0.5)
 
def is_allowed_cached(user_id: str) -> bool:
    cached = _local_cache.get(user_id)
    if cached is not None:
        return cached  # Use cached decision without Redis call
 
    # Cache miss: go to Redis
    result = redis_rate_limiter.is_allowed(user_id)
    _local_cache[user_id] = result["allowed"]
    return result["allowed"]
 
# Trade-off: User who hits limit might get 500ms more requests (cached "allowed")
# Acceptable for general APIs. Not acceptable for payment APIs.

Option 3: Async rate limiting for non-critical endpoints

import asyncio
from concurrent.futures import ThreadPoolExecutor
 
# For endpoints where slight over-limit is acceptable:
# Check rate limit asynchronously, don't block the request
async def rate_limit_async(user_id: str, handler):
    # Start handler immediately (no blocking)
    handler_task = asyncio.create_task(handler())
    # Check rate limit concurrently
    rl_task = asyncio.create_task(redis_check_async(user_id))
 
    result = await handler_task
    rl_result = await rl_task
 
    if not rl_result["allowed"]:
        # Request already processed, but log the violation
        # Could add to a "debt" bucket to reduce future allowance
        metrics.increment("rate_limit.post_hoc_denial")
 
    return result

Option 4: Rate limiting at a higher layer

Move rate limiting from application code to Nginx or API Gateway.
Nginx limit_req module performs rate limiting in C, without Python GIL,
without Redis round trips (local in Nginx shared memory), sub-millisecond.
For user-level limits that still need Redis: shard Redis by user ID hash.

Challenge 3: Split-Brain During Redis Failover

What Happened

A Redis Sentinel setup with 1 primary and 2 replicas experienced a network partition.
Sentinel promoted replica-A to primary because it could no longer see the original primary.
For 12 seconds, both old-primary and replica-A accepted writes (split-brain).
Rate limit counters diverged. After partition healed, users reported they had been
allowed 2x their rate limit during the incident window.

Technical Breakdown

Before partition:
  Primary: counter = 45 (45 requests processed)
  Replica-A: counter = 44 (1 write behind due to async replication lag)

During partition (12 seconds):
  Primary (isolated): receives 30 more requests -> counter = 75
  Replica-A (promoted): receives 30 requests -> counter = 74 (starts from 44)
  Total allowed = 60 requests instead of 55 (limit = 100 - no breach, but diverged)

After partition heals:
  Replica-A wins (it's the new primary)
  Old primary demoted to replica, syncs from new primary
  Old primary loses its 30 new writes
  Counter = 74 (not 75)
  Those 30 requests from old primary are "forgotten"
  Users who made requests to old primary get "extra" allowance

The Fix

Structural fix: Require write majority before accepting

Redis Cluster (not Sentinel) with min-replicas-to-write 1 configuration:

# Redis config
min-replicas-to-write 1
min-replicas-max-lag 10

# With this config: primary REFUSES writes if fewer than 1 replica is connected
# During partition: isolated primary cannot accept rate limit INCs
# Users get errors (not double allowance)
# Trade-off: reduced availability during partition vs correctness

Accept the inconsistency with monitoring:

For most APIs, 2x limit for 12 seconds during a rare Redis failover is acceptable.
The right response is:

class RateLimiterWithIncidentAwareness:
    def is_allowed(self, identifier: str) -> dict:
        result = self.redis_limiter.is_allowed(identifier)
        result["data_source"] = "primary"
 
        # If we know we are in a post-failover state, note it
        if self._is_post_failover_window():
            result["note"] = "post_failover_approximate"
            metrics.increment("rate_limit.post_failover_check")
 
        return result
 
    def _is_post_failover_window(self) -> bool:
        # Sentinel notifies app of failover via pub/sub
        # App sets a flag for 60 seconds after failover event
        return self._failover_detected_at is not None and \
               time.time() - self._failover_detected_at < 60

Challenge 4: IP Rotation Attack Bypasses Per-IP Limits

What Happened

A public API enforcing 100 requests/minute per IP was targeted by a scraper with access
to a /14 IPv4 block (262,144 IP addresses). The attacker sent ~5 requests per IP per
minute, staying under each IP's individual limit. Effective throughput: ~22,000 RPS.
The system ran out of database connections within 3 minutes.

Warning Signs

  • Many unique IPs each making exactly 1-5 requests, then never seen again
  • All source IPs from the same ASN (Autonomous System Number) or IP range
  • Requests arrive at a suspiciously uniform rate (1 req/IP every 200ms = scripted)
  • Content being scraped shows clear pattern (all product listings, all prices)

Diagnosis

# Find top ASNs by request count in nginx logs
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -20
# If top IPs are from same /24 block, it's likely a rotating attack
 
# Check with whois
whois 203.0.113.0 | grep -E "netname|country|org"
 
# Count unique IPs per 5-minute window
awk '{print $1, $4}' access.log | awk '{print substr($2, 2, 16), $1}' | sort | uniq | wc -l

The Fix

Layer 1: ASN-level blocking at CDN

Cloudflare: Security -> WAF -> Tools -> IP Access Rules
Action: Block
Value: AS12345 (the ASN of the attacker)
Notes: Only do this if the entire ASN is malicious. ISPs share ASNs with many customers.

Layer 2: Behavioral fingerprinting

class BehavioralRateLimiter:
    """
    Detects IP rotation attacks by tracking behavioral signals across requests.
    """
 
    def get_request_fingerprint(self, request) -> str:
        """
        Combine signals that are hard to rotate:
        - TLS fingerprint (JA3 hash): hard to change per request
        - HTTP/2 settings fingerprint: browser-specific, hard to fake
        - Accept-Language, Accept-Encoding: usually consistent per client
        - User-Agent: easy to fake but often consistent for bot libraries
        """
        import hashlib
        ja3 = request.headers.get("X-JA3-Fingerprint", "unknown")
        ua = request.headers.get("User-Agent", "")[:50]
        accept = request.headers.get("Accept", "")[:20]
        lang = request.headers.get("Accept-Language", "")[:10]
 
        fingerprint_raw = f"{ja3}:{ua}:{accept}:{lang}"
        return hashlib.sha256(fingerprint_raw.encode()).hexdigest()[:16]
 
    def is_allowed_with_fingerprint(self, request) -> bool:
        ip = get_real_ip(request)
        fingerprint = self.get_request_fingerprint(request)
 
        # Rate limit by fingerprint regardless of IP
        fp_key = f"rl:fp:{fingerprint}"
        fp_count = redis.incr(fp_key)
        redis.expire(fp_key, 60)
        if fp_count > 100:
            return False  # Same browser/tool, rotating IPs -> blocked
 
        # Also check per-IP
        ip_key = f"rl:ip:{ip}"
        ip_count = redis.incr(ip_key)
        redis.expire(ip_key, 60)
        if ip_count > 20:
            return False
 
        return True

Layer 3: Rate limit by ASN subnet

import ipaddress
 
def get_subnet_key(ip: str, prefix_len_v4: int = 24, prefix_len_v6: int = 48) -> str:
    ip_obj = ipaddress.ip_address(ip)
    if ip_obj.version == 4:
        network = ipaddress.ip_network(f"{ip}/{prefix_len_v4}", strict=False)
    else:
        network = ipaddress.ip_network(f"{ip}/{prefix_len_v6}", strict=False)
    return f"rl:subnet:{network.network_address}"
 
# Rate limit by /24 subnet (allows 256 IPv4 addresses to share a limit)
# Even if attacker rotates within /24, they share the counter
subnet_key = get_subnet_key(client_ip)
subnet_count = redis.incr(subnet_key)
redis.expire(subnet_key, 60)

Challenge 5: Carrier Grade NAT Blocking Thousands of Legitimate Users

What Happened

A mobile API using IP-based rate limiting started receiving customer complaints that
"the API is broken" for users on one major mobile carrier. Investigation revealed that
the carrier deployed Carrier Grade NAT (CGN), putting 50,000 mobile users behind 64
shared IP addresses. When any of those users hit the rate limit, all 50,000 were blocked.

Technical Details

CGN: 50,000 mobile users share 64 IP addresses
Per-IP limit: 100 requests/minute

User A sends 100 requests -> IP 203.0.113.1 counter = 100 -> limit hit
User A gets 429.
But also User B, C, D... through User N (all sharing 203.0.113.1) get 429.
A single user affects 781 other users on the same IP.

The Fix

Primary fix: Move to user-based rate limiting
This is the definitive solution. Require authentication and rate limit by user ID.

def get_rate_limit_identifier(request) -> tuple[str, str]:
    """
    Returns (identifier, identifier_type).
    Prefer specific identifiers over generic ones.
    """
    # 1. Best: Authenticated user ID (from JWT or session)
    user_id = extract_user_id_from_auth(request)
    if user_id:
        return f"user:{user_id}", "user"
 
    # 2. Good: API key
    api_key = request.headers.get("X-API-Key")
    if api_key:
        return f"apikey:{hash_key(api_key)}", "apikey"
 
    # 3. Fallback: IP (but with higher limit to account for NAT)
    ip = get_real_ip(request)
    return f"ip:{ip}", "ip"
 
IP_BASED_LIMIT = 500     # Higher limit for IP (accounts for NAT sharing)
USER_BASED_LIMIT = 100   # Tighter limit for identified users
 
def get_limit_for_identifier(identifier_type: str) -> int:
    return IP_BASED_LIMIT if identifier_type == "ip" else USER_BASED_LIMIT

Detect known CGN ranges and apply higher limits:

# Known CGN address ranges (RFC 6598: 100.64.0.0/10)
CGN_RANGES = ["100.64.0.0/10"]
MOBILE_CARRIER_RANGES = [
    # Load from a database of known shared NAT ranges
    # These change frequently - use a service like MaxMind or IP2Location
]
 
def is_cgn_address(ip: str) -> bool:
    ip_obj = ipaddress.ip_address(ip)
    return any(
        ip_obj in ipaddress.ip_network(r)
        for r in CGN_RANGES + MOBILE_CARRIER_RANGES
    )
 
def get_ip_limit(ip: str) -> int:
    if is_cgn_address(ip):
        return 5_000  # Shared NAT: 50x higher limit
    return 100        # Regular IP

Challenge 6: Clock Skew Causing Window Boundary Gaps

What Happened

A fixed-window rate limiter showed a strange pattern: users could make slightly more than
their 100-request/minute limit when the requests were sent near the 60-second boundary.
Forensic analysis of logs showed that 3 of 10 application servers had clocks drifted by
800ms-1200ms from the others.

Root Cause

Server A clock: 12:00:00.000 (on time)
Server B clock: 12:00:00.900 (900ms ahead)

At wall clock time 11:59:59.500:
  Server A says: "Still in window 11:59:00-12:00:00" -> increments window A counter
  Server B says: "Already in window 12:00:00-12:01:00" -> increments window B counter

User sends 60 requests in the last 500ms of the minute, routed between A and B:
  30 requests -> Server A -> window A counter = 30
  30 requests -> Server B -> window B counter = 30

User sends 60 requests in first 500ms of new minute:
  Same split: window A goes to 60, window B resets and goes to 30

Effective requests allowed in "same minute": 90 (30+30+30), not 60.
The 10 extra requests (the gap at the boundary) come from the clock disagreement.

The Fix

Use Redis server time in Lua scripts:

-- BROKEN: Uses client-supplied time (from a potentially drifted server)
local now = tonumber(ARGV[1])  -- application server time (may be drifted)
local window_id = math.floor(now / 60)
 
-- CORRECT: Use Redis server time (single source of truth)
local time_result = redis.call('TIME')
local now = tonumber(time_result[1]) + tonumber(time_result[2]) / 1e6
local window_id = math.floor(now / 60)
-- All rate limit checks on this Redis instance use identical time
-- Clock skew between application servers no longer matters

Important caveat: redis.call('TIME') makes the Lua script non-deterministic,
which can cause issues with Redis replication in some setups. The recommended approach:

  • For Redis Standalone: Use TIME in Lua
  • For Redis Cluster/Sentinel: Pass time from client but ensure NTP synchronization
  • Monitor clock drift: chronyc tracking | grep "System time"

Challenge 7: Thundering Herd After Maintenance Window

What Happened

A major API platform performed scheduled maintenance from 2:00 AM to 2:30 AM. The
maintenance page returned 503. All clients backed off. At exactly 2:30 AM when service
resumed, 200,000 clients simultaneously sent their pent-up requests. The database
connection pool exhausted in 4 seconds. Service fell over again immediately.

The Timeline

02:00 AM: Maintenance starts. Service returns 503.
02:00-02:30: Clients back off. Most are sleeping with Retry-After=1800 (30 minutes).
02:30 AM: Service restores. Returns 200 OK.
02:30:00: 200,000 clients wake up and hammer the API simultaneously.
02:30:04: DB connection pool exhausted (1000 connections, 200,000 requests = 200x capacity)
02:30:05: Service falls over again.
02:30:05-02:45: Cascading restart loop, escalating outage.

The Fix

Gradual ramp-up with a "recovery" rate limit:

class RecoveryAwareRateLimiter:
    """
    During recovery from an outage, temporarily reduce effective limits
    to give the system time to warm up.
    """
 
    def __init__(self, normal_limiter, recovery_duration: int = 300):
        self.normal_limiter = normal_limiter
        self.recovery_duration = recovery_duration  # 5-minute ramp
        self._recovery_start = None
 
    def signal_recovery_start(self):
        self._recovery_start = time.time()
 
    def get_recovery_multiplier(self) -> float:
        if self._recovery_start is None:
            return 1.0
 
        elapsed = time.time() - self._recovery_start
        if elapsed >= self.recovery_duration:
            self._recovery_start = None
            return 1.0
 
        # Linear ramp: 10% at recovery start, 100% after 5 minutes
        progress = elapsed / self.recovery_duration
        return 0.1 + (0.9 * progress)
 
    def is_allowed(self, identifier: str, base_limit: int) -> bool:
        multiplier = self.get_recovery_multiplier()
        effective_limit = max(1, int(base_limit * multiplier))
        return self.normal_limiter.is_allowed(identifier, limit=effective_limit)

Staggered Retry-After during maintenance:

def maintenance_response(request) -> Response:
    # Give clients DIFFERENT retry times (jittered) so they don't all come back at once
    user_id = extract_user_id(request) or request.remote_addr
    # Deterministic jitter: same user always gets same offset (consistent experience)
    user_hash = int(hashlib.sha256(user_id.encode()).hexdigest(), 16)
    base_retry = 1800  # 30 minutes until maintenance ends
    jitter = user_hash % 600  # Up to 10 minutes of jitter
    retry_after = base_retry + jitter
 
    return Response(
        status=503,
        headers={
            "Retry-After": str(retry_after),
            "X-Maintenance-End": "2025-01-01T02:30:00Z"
        }
    )

Challenge 8: Marketing Campaign Triggers False Positive Rate Limiting

What Happened

A company launched a product on Hacker News (HN front page). Within 20 minutes, 50,000
unique users visited the site simultaneously. 80% of those came from a small set of IPs
(HN link aggregator bots, VPN services, corporate proxies). The IP-based rate limiter
blocked most of the traffic, killing the viral moment. The CEO was furious.

The Fix

Shadow mode before launch:

# During the 48 hours before a campaign launch, run in shadow mode:
class ShadowRateLimiter:
    def __init__(self, limiter):
        self.limiter = limiter
        self.shadow_log = []
 
    def is_allowed_shadow(self, identifier: str) -> bool:
        result = self.limiter.is_allowed(identifier)
        if not result["allowed"]:
            # Would have been denied - log it, but allow anyway
            self.shadow_log.append({
                "time": time.time(),
                "identifier": identifier,
                "count": result["count"],
                "limit": result["limit"]
            })
            return True  # Allow in shadow mode
        return True

Tier the response instead of hard block:

def handle_request_gracefully(request) -> Response:
    identifier, id_type = get_rate_limit_identifier(request)
    result = rate_limiter.check(identifier)
 
    if not result["allowed"] and id_type == "ip":
        # Under heavy load: serve cached/degraded response instead of 429
        cached_response = cache.get_stale(request.path)
        if cached_response:
            return Response(
                body=cached_response,
                status=200,
                headers={
                    "X-Cache": "STALE",
                    "X-RateLimit-Warning": "Near limit, serving cached response"
                }
            )
 
    return normal_handler(request)

Challenge 9: Blue-Green Deployment Resets Rate Limit State

What Happened

A team used in-memory rate limiting (not Redis). Their deployment process was:

  1. Deploy new version (green) alongside old version (blue)
  2. Route 100% traffic to green
  3. Terminate blue

When green started, all in-memory rate limit counters were zero. Users who had just hit
their limit on blue were suddenly able to make fresh requests on green. Sophisticated
users discovered they could reset their rate limits by sending a request that triggered
a deployment or by waiting for the next deployment cycle.

The Fix

# WRONG: In-memory state resets on deployment
class InMemoryRateLimiter:
    def __init__(self):
        self._counters = {}  # LOST on every deployment
 
# CORRECT: All state in Redis (survives any number of deployments)
class RedisRateLimiter:
    def __init__(self, redis_client):
        self.r = redis_client
    # State lives in Redis, not in the application process.
    # Blue-green, rolling, canary deployments all work transparently.
    # Even if ALL instances are restarted simultaneously, state survives.

For the transition period (migrating from in-memory to Redis):

# Dual-write during migration: write to both, read from Redis
class MigratingRateLimiter:
    def __init__(self, local_limiter, redis_limiter):
        self.local = local_limiter
        self.redis = redis_limiter
 
    def is_allowed(self, identifier: str) -> bool:
        local_result = self.local.is_allowed(identifier)
        redis_result = self.redis.is_allowed(identifier)
        # During migration: use Redis result (more accurate)
        # Log when they disagree
        if local_result != redis_result["allowed"]:
            logger.info("migration_disagreement", identifier=identifier,
                       local=local_result, redis=redis_result["allowed"])
        return redis_result["allowed"]

Challenge 10: Database Connection Pool Exhaustion from Rate Limit Metadata Queries

What Happened

The rate limiter needed to look up each user's subscription tier to determine their limit.
This tier lookup queried the main PostgreSQL database on every request. At 5,000 RPS,
this generated 5,000 DB queries/second just for rate limit tier lookups. The database
connection pool (50 connections) was shared with the application. Rate limit lookups
consumed 30+ connections, leaving insufficient connections for actual business logic.
P99 latency for the entire API went from 50ms to 800ms.

Root Cause

5,000 RPS
x 1 tier lookup per request
x 5ms avg DB query time
= 25 CPU-seconds/second of DB work just for rate limits
+ 50 DB connections shared with application
= DB connection pool exhausted

The Fix

Cache tier data in Redis, not in the DB hot path:

import redis
import json
from typing import Optional
 
TIER_CACHE_TTL = 300  # Cache tier for 5 minutes
 
def get_user_tier_cached(user_id: str, r: redis.Redis, db) -> str:
    cache_key = f"tier:{user_id}"
 
    # Check Redis cache first (microseconds, not milliseconds)
    cached = r.get(cache_key)
    if cached:
        return cached
 
    # Cache miss: hit DB (happens at most once per 5 minutes per user)
    tier = db.query("SELECT tier FROM users WHERE id = %s", user_id)
    r.setex(cache_key, TIER_CACHE_TTL, tier)
    return tier
 
# At 5,000 RPS with 100,000 unique users and 5-minute TTL:
# Cache hit rate: ~99.9% (5000 req/s x 300s = 1,500,000 requests per TTL period)
# (100,000 users / 1,500,000 requests = 0.07% cache miss rate)
# DB queries for tier lookup: ~3.5/second (not 5,000/second)
# DB connection usage for rate limiting: ~0 (not 30+)

Even better: Pre-warm tier cache on login/token refresh:

def generate_token(user_id: str) -> str:
    tier = db.get_user_tier(user_id)
    token = create_jwt({"user_id": user_id, "tier": tier})
 
    # Cache tier at login time - invalidate on plan change
    r.setex(f"tier:{user_id}", 3600, tier)
    return token
 
# Now tier is always in cache when user makes API calls (they just logged in)
# DB is only hit when token is generated, not on every API call

Challenge 11: Internal Service Fan-Out DoS

What Happened

Service A (user-facing API) called Service B (data enrichment service) for every user
request. Service B was designed assuming 100 RPS from Service A. One day, Service A
launched a new feature that made Service A call Service B 10 times per user request
(fetching 10 data fields instead of 1). Service A received 200 RPS from users.
Service B now received 2,000 RPS instead of 200 RPS. Service B fell over.
Service A started timing out. Users saw 500 errors for 40 minutes until Service B was
scaled up.

The Fix

Outbound rate limiting at the caller:

@Service
public class EnrichmentClient {
 
    // Outbound rate limiter: limit calls TO Service B
    private final Bucket outboundBucket = Bucket4j.builder()
        .addLimit(Bandwidth.classic(500, Refill.greedy(500, Duration.ofSeconds(1))))
        .build();
 
    public List<EnrichmentData> enrich(List<String> ids) {
        // Batch: instead of N calls, send 1 call with N IDs
        // (Requires Service B to have a bulk endpoint)
        return batchFetch(ids);
    }
 
    public EnrichmentData enrichOne(String id) {
        // Rate-limited outbound call
        if (!outboundBucket.tryConsume(1)) {
            // Outbound limit hit: return cached/default data instead of erroring
            return getDefaultEnrichment(id);
        }
        return serviceB.get(id);
    }
}

Contract testing between services:

# Service B publishes its rate limit contract
# service-b-contract.yaml
rate_limit:
  max_rps: 500
  max_concurrent_requests: 50
  burst_size: 100
 
# Service A MUST configure its outbound limiter based on Service B's contract
# Contract testing: if Service B changes its limit, Service A's tests catch it

Challenge 12: Background Job Bug Exhausts Entire Daily Quota

What Happened

A data synchronization job had a bug: it had an infinite retry loop when it encountered
a specific error condition. The bug was not caught in testing because the error condition
only occurred with production data. The job started at midnight and by 1 AM had consumed
the entire month's API quota for a paid third-party service. The company received a $12,000
overage bill.

The Fix

Separate quotas for automated vs human callers:

class SegregatedQuotaManager:
 
    QUOTA_SEGMENTS = {
        "human_interactive": {"daily": 100_000, "burn_rate_alert": 5_000},   # per hour
        "batch_jobs":        {"daily": 500_000, "burn_rate_alert": 50_000},   # per hour
        "sync_service":      {"daily": 200_000, "burn_rate_alert": 10_000},   # per hour
    }
 
    def check_quota(self, segment: str) -> bool:
        daily_key = f"quota:{segment}:{self._today()}"
        count = self.r.incr(daily_key)
        if count == 1:
            self.r.expire(daily_key, 86400 * 2)
 
        limit = self.QUOTA_SEGMENTS[segment]["daily"]
        if count > limit:
            self.alert(f"Quota exhausted for {segment}")
            return False
 
        # Burn rate alert
        hourly_key = f"quota_hr:{segment}:{self._current_hour()}"
        hourly = int(self.r.get(hourly_key) or 0)
        alert_threshold = self.QUOTA_SEGMENTS[segment]["burn_rate_alert"]
        if hourly > alert_threshold:
            self.alert(f"High burn rate for {segment}: {hourly}/hour (alert at {alert_threshold})")
 
        return True

Circuit breaker on infinite-loop detection:

class JobRateLimiter:
    """
    Detects runaway jobs by tracking per-job-instance request rate.
    """
    def __init__(self, r, max_rpm_per_job: int = 60):
        self.r = r
        self.max_rpm = max_rpm_per_job
 
    def check_job(self, job_id: str) -> bool:
        key = f"job_rl:{job_id}:{int(time.time())//60}"
        count = self.r.incr(key)
        self.r.expire(key, 120)
 
        if count > self.max_rpm:
            # This job instance is making >60 requests/minute - likely a bug
            self.r.set(f"job_killed:{job_id}", 1, ex=3600)
            alert_ops(f"Job {job_id} killed: {count} RPM (max {self.max_rpm})")
            return False
 
        # Check if this job has been killed
        if self.r.exists(f"job_killed:{job_id}"):
            return False
 
        return True

Challenge 13: Multi-Tenant Quota Bleeding

What Happened

A SaaS platform had 100 customers sharing infrastructure, including a shared "API quota"
for a third-party AI service. One large customer (paying 500/month)ranabulkbatchjobthatconsumed90500/month) ran a bulk batch job that consumed 90% of the shared AI API quota. All other 99 customers (also paying 500/month each) could no longer use the AI features for the rest of the day.
The support queue received 200 tickets. Churn rate doubled that week.

The Fix

Hard per-tenant resource isolation:

class TenantQuotaManager:
    """
    Each tenant has completely isolated quotas.
    One tenant cannot consume another tenant's resources.
    """
 
    def __init__(self, r, plan_quotas: dict):
        self.r = r
        self.plan_quotas = plan_quotas
 
    def check_tenant_quota(self, tenant_id: str, resource: str) -> dict:
        plan = self.get_tenant_plan(tenant_id)
        quota = self.plan_quotas[plan][resource]
        today = time.strftime("%Y-%m-%d")
        key = f"quota:tenant:{tenant_id}:{resource}:{today}"
 
        count = self.r.incr(key)
        if count == 1:
            self.r.expire(key, 86400 * 2)
 
        return {
            "allowed": count <= quota,
            "used": count,
            "limit": quota,
            "remaining": max(0, quota - count),
            "tenant": tenant_id  # Each tenant has their OWN counter
        }
 
    def get_tenant_quota_usage(self, tenant_id: str) -> dict:
        today = time.strftime("%Y-%m-%d")
        pattern = f"quota:tenant:{tenant_id}:*:{today}"
        usage = {}
        for key in self.r.scan_iter(pattern):
            resource = key.split(":")[3]
            usage[resource] = int(self.r.get(key) or 0)
        return usage

Challenge 14: Lua Script Timeout Under Load

What Happened

A rate limiter used a Lua script to implement GCRA (Generic Cell Rate Algorithm). The
script worked perfectly for months. During a traffic spike (5x normal), Redis CPU
hit 100% and started killing Lua scripts that exceeded the lua-time-limit (5 seconds).
All rate limit checks started returning errors. The fail-open policy kicked in. For
8 minutes, there were no rate limits at all.

Root Cause

Normal:    1,000 Lua executions/second -> Redis CPU 20%
Spike:     5,000 Lua executions/second -> Redis CPU 100%
Each Lua execution blocks Redis for 50-200 microseconds
At 100% CPU: executions queue up -> execution time grows -> timeout hit

The Fix

Profile and simplify the Lua script:

# Redis slow log for Lua
redis-cli CONFIG SET slowlog-log-slower-than 500  # 0.5ms threshold
redis-cli SLOWLOG GET 25
 
# Benchmark your Lua script directly
redis-cli --latency-history -i 1
redis-cli DEBUG SLEEP 0  # Measures baseline latency
 
# Benchmark specific Lua script
redis-cli EVAL "return redis.call('SET', KEYS[1], ARGV[1])" 1 testkey testval

Replace complex GCRA Lua with simpler sliding window counter:

-- BEFORE: Complex GCRA with floating point math (slower)
local tat = tonumber(redis.call('GET', KEYS[1]) or '0')
local now = tonumber(ARGV[1])
local emission_interval = tonumber(ARGV[2])
local burst_offset = tonumber(ARGV[3])
local new_tat = math.max(now, tat) + emission_interval
local allowed_at = new_tat - burst_offset
if allowed_at <= now then
    redis.call('SETEX', KEYS[1], tonumber(ARGV[4]), tostring(new_tat))
    return {1, new_tat - now}
end
return {0, allowed_at - now}
-- Execution time: ~150 microseconds
 
-- AFTER: Simple sliding window counter (faster)
local k1 = KEYS[1] .. ':' .. math.floor(tonumber(ARGV[3]) / tonumber(ARGV[2]))
local k2 = KEYS[1] .. ':' .. (math.floor(tonumber(ARGV[3]) / tonumber(ARGV[2])) - 1)
local c1 = tonumber(redis.call('GET', k1) or '0')
local c2 = tonumber(redis.call('GET', k2) or '0')
local w = (tonumber(ARGV[3]) % tonumber(ARGV[2])) / tonumber(ARGV[2])
if c1 + c2 * (1 - w) < tonumber(ARGV[1]) then
    redis.call('INCR', k1)
    redis.call('EXPIRE', k1, tonumber(ARGV[2]) * 2)
    return 1
end
return 0
-- Execution time: ~50 microseconds. 3x faster.

Challenge 15: Missing Retry-After Causes Self-Inflicted Retry Storm

What Happened

An internal service received 429 responses from a dependency service that had just
deployed a new rate limiter. The new rate limiter returned 429 but did NOT include the
Retry-After header. The calling service had this logic:

if response.status_code == 429:
    time.sleep(1)  # Default: retry after 1 second
    return self.retry(request)

The rate limit window was 60 seconds. Clients retried every 1 second, generating 60
retries per blocked request. The dependency service now received 60x more requests from
the retry storm than from original traffic. Its rate limiter blocked even more traffic.
A feedback loop: more rate limiting -> more retries -> more rate limiting.

The Fix

Always include Retry-After:

def handle_rate_limited_response(identifier: str, reset_at: int) -> Response:
    now = int(time.time())
    retry_after = max(1, reset_at - now)
 
    # Optional: Add jitter to stagger retries
    retry_after_jittered = retry_after + random.randint(0, 10)
 
    return Response(
        status=429,
        headers={
            "Retry-After": str(retry_after_jittered),
            "X-RateLimit-Reset": str(reset_at),
            "X-RateLimit-Remaining": "0"
        },
        body={
            "error": "rate_limit_exceeded",
            "retry_after": retry_after_jittered
        }
    )

Calling service: respect Retry-After, use backoff:

def call_with_retry(func, max_retries=5):
    for attempt in range(max_retries):
        response = func()
        if response.status_code == 429:
            # Use server's Retry-After, with fallback to exponential backoff
            retry_after = response.headers.get("Retry-After")
            if retry_after:
                wait = float(retry_after)
            else:
                wait = min(300, (2 ** attempt) + random.uniform(0, 5))  # jitter
            if attempt < max_retries - 1:
                time.sleep(wait)
        else:
            return response
    raise MaxRetriesExceeded()

Challenge 16: Hot Key Problem in Redis Cluster

What Happened

A public API where a viral post made one specific content creator have 1,000x normal
traffic. All requests for that creator's content were rate-limited by the same Redis key:
rl:user:{creator_id}. That key was on shard 3 of a 10-node Redis Cluster. Shard 3
was at 100% CPU while all other shards were at 8%. p99 latency for all users of shard 3
(not just this creator) was 300ms instead of the normal 2ms.

The Fix

Local sharding for hot keys:

import hashlib
import random
 
NUM_VIRTUAL_SHARDS = 10
 
def get_sharded_key(base_key: str, shards: int = NUM_VIRTUAL_SHARDS) -> str:
    """
    For hot keys: distribute across N virtual shards to spread load.
    The actual count requires summing all shards (approximate).
    """
    shard_id = random.randint(0, shards - 1)
    return f"{base_key}:shard:{shard_id}"
 
def get_total_count_sharded(base_key: str, r, shards: int = NUM_VIRTUAL_SHARDS) -> int:
    """Get approximate total count across all shards."""
    pipe = r.pipeline()
    for i in range(shards):
        pipe.get(f"{base_key}:shard:{i}")
    results = pipe.execute()
    return sum(int(v or 0) for v in results)
 
# For the hot creator: 10 Redis keys instead of 1
# Spread across 10 shards -> 10x load distribution
# Trade-off: reading total count requires 10 GET operations (batched in pipeline)

Per-endpoint limits for specific hot users:

def get_limit_for_hot_user(user_id: str) -> int:
    # Detect hot users: if their per-minute count exceeds normal by 100x
    # Automatically route them to a higher limit (so their requests don't
    # clog the rate limit check for other users)
    if is_hot_user(user_id):
        return get_enterprise_limit(user_id)  # higher limit, separate key space
    return get_standard_limit(user_id)

Challenge 17: Rate Limit Bypass via HTTP Method Variation

What Happened

An API limited GET /api/users to 100 RPM. A developer discovered that the rate limiter
checked HTTP method + path together. POST /api/users was used for user creation and had
a different (looser) rate limit. The developer sent POST /api/users with a search body
to do pagination reads, bypassing the GET limit.

The Fix

Rate limit by normalized resource path, not HTTP method:

RESOURCE_LIMITS = {
    "/api/users":          100,   # All methods share this limit
    "/api/products":       500,
    "/api/search":         30,
    "/api/auth/login":     5,
}
 
def get_resource_key(request) -> str:
    # Normalize: strip method, normalize path params
    path = normalize_path(request.path)
    # All methods to the same resource share one counter
    return f"rl:resource:{user_id}:{path}"
 
# GET /api/users and POST /api/users both decrement the same "users" counter
# Method-specific limits still possible via a separate higher-cost counter:
WRITE_ADDITIONAL_COST = {
    "POST":   2,   # Writes cost 2x reads
    "PUT":    2,
    "DELETE": 3,
    "PATCH":  1,
    "GET":    1,
    "HEAD":   0,   # No cost
}

Challenge 18: False Positive During Feature Flag Rollout

What Happened

A new "smart refresh" feature for mobile clients was rolled out via feature flags to 1%
of users. The feature caused the mobile app to poll an API 10x more frequently to keep
content fresh. That 1% of users immediately hit their rate limits. The team interpreted
this as a bug in the mobile app, not a rate limit issue, and spent 4 hours debugging
mobile code before realizing the feature simply needed higher rate limits.

The Fix

Pre-flight rate limit impact analysis for feature flags:

class FeatureFlagRateLimitAnalyzer:
    """
    Before enabling a feature flag, estimate its rate limit impact.
    """
    def estimate_impact(
        self,
        feature_name: str,
        api_calls_per_user_per_hour: int,
        rollout_percentage: float,
        total_active_users: int
    ) -> dict:
        affected_users = int(total_active_users * rollout_percentage)
        additional_rps = (affected_users * api_calls_per_user_per_hour) / 3600
        user_usage_pct = api_calls_per_user_per_hour / USER_HOURLY_LIMIT * 100
 
        return {
            "feature": feature_name,
            "affected_users": affected_users,
            "additional_rps": additional_rps,
            "user_rate_limit_usage_pct": user_usage_pct,
            "will_hit_user_limit": user_usage_pct > 80,
            "recommendation": (
                "Increase user limit before rollout"
                if user_usage_pct > 80 else "Safe to roll out"
            )
        }
 
# Before rolling out "smart refresh":
analysis = analyzer.estimate_impact(
    feature_name="smart_refresh",
    api_calls_per_user_per_hour=600,   # 10x polling, 60 polls/hour * 10 = 600
    rollout_percentage=0.01,
    total_active_users=500_000
)
# Would have shown: user_rate_limit_usage_pct = 120% -> will_hit_user_limit = True
# Team would have adjusted limits before rollout

Challenge 19: WebSocket Orphan Connection Accumulation

What Happened

A real-time messaging service had a limit of 5 concurrent WebSocket connections per user.
Mobile clients that lost network connectivity did not send a CLOSE frame. The server-side
connections lingered for hours (no timeout configured). Users on unstable mobile connections
hit their 5-connection limit after switching between WiFi and cellular a few times. They
could no longer connect to the messaging service. Support tickets: "App is broken."

The Fix

class WebSocketConnectionManager:
    """
    Track and enforce concurrent WebSocket limits with TTL-based cleanup.
    """
 
    MAX_CONNECTIONS_PER_USER = 5
    CONNECTION_TTL = 300  # 5-minute heartbeat window
 
    def add_connection(self, user_id: str, connection_id: str) -> bool:
        key = f"ws:conns:{user_id}"
        conn_ttl_key = f"ws:conn:{connection_id}:ttl"
 
        # Cleanup stale connections first
        self._cleanup_stale(user_id)
 
        # Check current count
        count = int(self.r.scard(key) or 0)
        if count >= self.MAX_CONNECTIONS_PER_USER:
            return False
 
        # Add connection
        self.r.sadd(key, connection_id)
        self.r.expire(key, self.CONNECTION_TTL * 2)
        self.r.setex(conn_ttl_key, self.CONNECTION_TTL, "alive")
        return True
 
    def heartbeat(self, connection_id: str) -> None:
        """Client sends heartbeat every 60 seconds to keep connection alive."""
        conn_ttl_key = f"ws:conn:{connection_id}:ttl"
        self.r.expire(conn_ttl_key, self.CONNECTION_TTL)
 
    def remove_connection(self, user_id: str, connection_id: str) -> None:
        key = f"ws:conns:{user_id}"
        self.r.srem(key, connection_id)
        self.r.delete(f"ws:conn:{connection_id}:ttl")
 
    def _cleanup_stale(self, user_id: str) -> None:
        """Remove connections whose heartbeat TTL has expired."""
        key = f"ws:conns:{user_id}"
        all_connections = self.r.smembers(key)
        for conn_id in all_connections:
            ttl_key = f"ws:conn:{conn_id}:ttl"
            if not self.r.exists(ttl_key):
                # TTL expired = heartbeat not received = stale connection
                self.r.srem(key, conn_id)

Challenge 20: Multi-Region Request Duplication Exhausting Limits

What Happened

A global API with US-EAST and EU-WEST regions used cross-region active-active replication.
A mobile client on a flaky connection sometimes had requests processed in BOTH regions
(the first region processed the request but the response was lost; the client retried and
hit the second region). The user's rate limit counter was incremented TWICE for one logical
request. At 50 requests/minute, a user on a flaky mobile network could hit their 100
request/minute limit after only 50 actual logical requests.

The Fix

Idempotency key-based counting:

class IdempotencyAwareRateLimiter:
    """
    Uses idempotency keys to ensure retries of the same request
    do not consume additional rate limit tokens.
    """
 
    IDEMPOTENCY_TTL = 300  # 5 minutes
 
    def check_and_consume(
        self,
        user_id: str,
        idempotency_key: str = None,
        cost: int = 1
    ) -> dict:
        if idempotency_key:
            # Check if we've already processed this idempotency key
            idem_redis_key = f"idem:{user_id}:{idempotency_key}"
            existing = self.r.get(idem_redis_key)
 
            if existing:
                # Already counted - return same result, no new charge
                result = json.loads(existing)
                result["idempotent_replay"] = True
                return result
 
        # New request: check rate limit
        result = self.rate_limiter.check(user_id, cost=cost)
 
        if idempotency_key and result["allowed"]:
            # Store result for idempotency replay
            self.r.setex(
                f"idem:{user_id}:{idempotency_key}",
                self.IDEMPOTENCY_TTL,
                json.dumps(result)
            )
 
        return result

Client-side: generate idempotency key once per logical request:

import uuid
 
class ResilientAPIClient:
    def make_request(self, method: str, path: str, body: dict = None) -> dict:
        # Generate idempotency key ONCE for this logical request
        idempotency_key = str(uuid.uuid4())
 
        for attempt in range(5):
            response = requests.request(
                method, f"{self.base_url}{path}",
                json=body,
                headers={
                    "X-Idempotency-Key": idempotency_key,  # Same key on every retry
                    "Authorization": f"Bearer {self.token}"
                }
            )
            if response.status_code != 429:
                return response.json()
 
            retry_after = float(response.headers.get("Retry-After", 2 ** attempt))
            time.sleep(retry_after + random.uniform(0, 1))
 
        raise MaxRetriesError()

Production Challenge Quick Reference

ChallengePrimary CauseDetection SignalKey Fix
Memory explosionWrong algorithm at scaleRedis memory climbingSwitch to O(1) algorithm
Rate limiter bottleneckToo many Redis opsp99 latency spikePipeline + local cache
Split-brain failoverRedis replication lag2x traffic during failoverAccept or use Cluster quorum
IP rotation attackASN-wide attackMany IPs, same ASN, low per-IPFingerprinting + ASN blocking
CGN blockingShared IP for thousandsOne carrier's users all blockedUser-based limiting
Clock skewNTP driftBoundary-case over-limitRedis TIME in Lua
Thundering herdSynchronized retrySpike at maintenance endJittered Retry-After
False positive campaignIP-based + viral trafficSupport spike + social mediaBehavioral signals + soft limits
Deployment resetIn-memory stateLimits reset on deployRedis-backed state
DB pool exhaustionTier lookup in hot pathDB connections spikeCache tier in Redis
Internal fan-out DoS1 request -> N internal callsService B overwhelmed by AOutbound limiting at caller
Job bug quota drainInfinite loopQuota burned in minutesSeparate job quotas + burn alert
Multi-tenant bleedingShared quotaOne tenant blocks othersPer-tenant isolation
Lua timeoutComplex scriptRedis BUSY errorsSimplify Lua
Retry stormMissing Retry-After429 storm in logsAlways include Retry-After
Hot key CPU spikeViral userOne Redis shard at 100%Virtual key sharding
Method bypassMethod-specific limitsUnusual POST patternsResource-based (not method) limits
Feature flag impactNo pre-flight analysisSudden rate limit spikeRate limit impact analysis
WebSocket orphanNo heartbeat timeoutConnection count climbsTTL-based cleanup
Multi-region duplicationNo idempotencyFlaky mobile double-countedIdempotency keys in rate limits

Next Supplement: Supplement 3 - Trade-Offs and Decision Guide