← Back to Articles
6/6/2026Admin Post

rate limiting part4 distributed

Rate Limiting Demystified - Part 4: Distributed Rate Limiting

Series Navigation:
Index |
Part 1 - Fundamentals |
Part 2 - Algorithms |
Part 3 - Implementation |
Part 5 - Advanced |
Part 6 - Interview Questions


Table of Contents

  1. Why Distributed Rate Limiting Is Hard
  2. Centralized Rate Limiting
  3. Decentralized (Local) Rate Limiting
  4. Race Conditions and Atomicity
  5. Redis Lua Scripts for Atomicity
  6. Redis Transactions vs Lua Scripts
  7. Redis Cluster Considerations
  8. Sticky Sessions Approach
  9. Hybrid: Local Approximate + Global Precise
  10. Rate Limiting in a Service Mesh
  11. Multi-Region Rate Limiting
  12. Handling Redis Failures: Fail-Open vs Fail-Closed

1. Why Distributed Rate Limiting Is Hard

The Single-Server Illusion

On a single server, rate limiting is trivial. A global counter in memory, incremented on
every request, checked against a limit. Done.

Single Server:
[User] --> [Server A (counter=47)] --> Allow or Deny

But almost no production system runs on a single server. You have a cluster:

Distributed:
[User] --> [Load Balancer]  --> [Server A (counter=15)]
                            --> [Server B (counter=12)]
                            --> [Server C (counter=20)]

Each server only sees its fraction of requests. If the limit is 100/minute and you have
3 servers each counting independently, you effectively allow 300 requests/minute. Your
rate limit is broken.

The Four Core Problems

Problem 1: No Shared State
Each application instance maintains its own in-memory counter. These counters are not
synchronized. The total rate can be limit * number_of_instances.

Problem 2: Race Conditions
Even with a shared Redis, multiple instances can race to read-then-write:

Time 0: Server A reads count=99, Server B reads count=99 (both see 99 < 100)
Time 1: Server A writes count=100 (increments to 100)
Time 2: Server B writes count=100 (ALSO increments to 100 - both allowed!)
Result: 2 requests allowed when only 1 should be permitted

Problem 3: Clock Skew
Different servers have slightly different system clocks. In fixed-window rate limiting,
the window boundaries are based on the current time. If two servers disagree on the time
by even 1 second, they may count in different windows, splitting the count.

Server A clock: 12:01:00.500
Server B clock: 12:01:01.200

At the transition from window 1 to window 2:
Server A puts its requests in Window 2 (clock says 12:01:01)
Server B puts its requests in Window 1 (clock says 12:01:00.9)
Each window appears to have fewer requests than it should.

Problem 4: Network Latency
Every rate limit check requires a round trip to Redis. At p99, this can add 5-20ms of
latency. For high-frequency endpoints, this overhead is significant.


2. Centralized Rate Limiting

Architecture

All rate limit state lives in a single shared store (Redis). Every application instance
queries this store for every request.

[Server A] --+
[Server B] --+--> [Redis Cluster] (single source of truth)
[Server C] --+

Implementation (as seen in Part 3)

The implementations using Redis INCR, ZADD, or Lua scripts in Part 3 ARE centralized
rate limiting. The key insight: by pushing all state to Redis, all instances share the
same counter.

Redis Key Design for Centralized Rate Limiting

# Good: Namespaced, scannable, evictable
key = f"rl:{algorithm}:{entity_type}:{identifier}:{window}"
 
# Examples:
"rl:fw:user:user_123:1735689600"      # Fixed window, user, window ID
"rl:sw:apikey:abc123"                  # Sliding window, API key
"rl:tb:ip:203.0.113.42"               # Token bucket, IP address
"rl:fw:global:system:1735689600"       # Global limit
 
# Key length matters: binary_remote_addr vs remote_addr in Nginx
# IPv4: 4 bytes vs up to 15 characters
# Use hashing for long identifiers
import hashlib
def make_redis_key(prefix: str, identifier: str) -> str:
    if len(identifier) > 64:
        identifier = hashlib.sha256(identifier.encode()).hexdigest()[:16]
    return f"{prefix}:{identifier}"

Pros and Cons of Centralized

ProsCons
Accurate: single source of truthExtra network hop per request (~1-5ms)
Works with any number of instancesRedis is a single point of failure
No complex synchronization neededRedis can become a bottleneck at very high QPS
Easy to implement with RedisRequires Redis infrastructure

Capacity Planning for Redis Rate Limiting

Calculation:
  Users: 1,000,000
  Keys per user: 3 (per-second, per-minute, per-day)
  Key size: ~50 bytes average
  Value size: 8 bytes (int64 counter)
  TTL overhead: minimal (Redis handles internally)

Memory estimate:
  1,000,000 users x 3 keys x (50 + 8) bytes = ~174 MB

  For sliding window log at 1000 limit:
  1,000,000 users x 1000 entries x 20 bytes = ~20 GB (!)
  -> This is why sliding window log is rarely used at scale

3. Decentralized (Local) Rate Limiting

Architecture

Each server instance maintains its own in-memory rate limiter. No shared state.

[Server A] --> [Local Limiter A: count=15]
[Server B] --> [Local Limiter B: count=12]
[Server C] --> [Local Limiter C: count=20]

When Local Rate Limiting Is Acceptable

Surprisingly, local rate limiting IS acceptable in specific scenarios:

  1. Single-instance deployments: Development environments, small services
  2. Load balanced sticky sessions: User always hits the same server (see section 8)
  3. Approximate global limits: You want to limit to ~N per server, total is ~N*instances
  4. Protecting individual server resources: CPU, memory, connections per server
  5. Rate limiting outbound calls: When YOUR code calls external APIs, local bucket is fine

Java In-Memory Implementation (Bucket4j)

@Service
public class LocalRateLimitService {
 
    // ConcurrentHashMap for thread safety
    private final ConcurrentHashMap<String, Bucket> userBuckets = new ConcurrentHashMap<>();
    // Scheduled cleanup to prevent memory leaks
    @Scheduled(fixedDelay = 300_000) // every 5 minutes
    public void cleanupStaleBuckets() {
        // Remove buckets that haven't been used recently
        // In production, use Caffeine cache with expiry instead
        userBuckets.entrySet().removeIf(entry ->
            entry.getValue().getAvailableTokens() == entry.getValue().asVerbose()
                .getConfiguration().getBandwidths()[0].getCapacity()
        );
    }
 
    public boolean isAllowed(String userId) {
        Bucket bucket = userBuckets.computeIfAbsent(userId, key ->
            Bucket4j.builder()
                .addLimit(Bandwidth.classic(100, Refill.greedy(100, Duration.ofMinutes(1))))
                .build()
        );
        return bucket.tryConsume(1);
    }
}

Memory Leak Warning

Local in-memory rate limiters MUST have a cleanup strategy:

from cachetools import TTLCache
import threading
 
class LocalRateLimiter:
    """
    Local rate limiter with automatic cleanup via TTL cache.
    Each user's counter automatically expires after the window.
    """
 
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window_seconds = window_seconds
        # TTLCache automatically evicts entries after maxsize or ttl
        self.cache = TTLCache(maxsize=100_000, ttl=window_seconds)
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> bool:
        with self.lock:
            count = self.cache.get(identifier, 0)
            if count >= self.limit:
                return False
            self.cache[identifier] = count + 1
            return True

4. Race Conditions and Atomicity

The Classic Race Condition

Redis key: "rl:user123" = 99
Limit: 100

Thread A: GET "rl:user123" -> 99 (reads 99)
Thread B: GET "rl:user123" -> 99 (reads 99)
Thread A: 99 < 100 -> allowed! SET "rl:user123" 100
Thread B: 99 < 100 -> allowed! SET "rl:user123" 100

Result: Both allowed. But only one should have been.

Why INCR Solves the Counter Problem

Redis INCR is atomic - it reads and increments in a single operation. But the conditional
check (is count <= limit?) is still outside Redis, creating a TOCTOU (Time of Check, Time
of Use) vulnerability.

# INCORRECT: Non-atomic check-then-increment
count = redis.get(key)
if int(count or 0) < limit:
    redis.incr(key)   # Another thread may increment between these two lines
    return True
return False
 
# BETTER: Atomic increment, then check
count = redis.incr(key)   # Atomic! No race condition in the increment itself
if count == 1:
    redis.expire(key, window_seconds)
return count <= limit
# Still a minor issue: multiple "count == 1" checks can race on expire
# But the count check is now safe

The Remaining Problem with INCR

Even with INCR, there is a subtle issue: if 1000 concurrent requests all INCR and get
values 1-1000, the first 100 (those whose INCR returned 1-100) will be allowed. This is
CORRECT behavior. The race condition above where two clients both see 99 is eliminated.

However, the EXPIRE setting is still racy:

count = redis.incr(key)      # Thread A: gets 1
if count == 1:                # Thread A: true
    redis.expire(key, 60)     # Thread A: sets TTL
                              # CRASH HERE - process dies
                              # Thread A never sets TTL
                              # Thread B: gets 2, count != 1, doesn't set TTL
                              # Key lives FOREVER
 
# Fix: Always set expiry, not just on first request
count = redis.incr(key)
redis.expire(key, 60)  # Set on every request (slightly wasteful but safe)
 
# Better fix: Use SET with NX and EX
# Or use a Lua script

5. Redis Lua Scripts for Atomicity

Lua scripts in Redis execute atomically. The entire script runs without interruption
from any other Redis command. This is the correct solution for complex rate limiting logic.

Complete Lua Script Reference

-- atomic_rate_limit.lua
-- Implements sliding window counter with atomic check-and-increment
--
-- KEYS[1]: base key for this rate limit
-- ARGV[1]: limit (max requests per window)
-- ARGV[2]: window_size (in seconds)
-- ARGV[3]: current_time (Unix timestamp, integer seconds)
-- Returns: {allowed (0/1), current_count, remaining, reset_at}
 
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
 
-- Determine window boundaries
local current_window_id = math.floor(now / window)
local prev_window_id = current_window_id - 1
 
local curr_key = key .. ':' .. current_window_id
local prev_key = key .. ':' .. prev_window_id
 
-- Get counts for current and previous windows
local curr_count = tonumber(redis.call('GET', curr_key) or '0')
local prev_count = tonumber(redis.call('GET', prev_key) or '0')
 
-- How far are we into the current window? (0.0 to 1.0)
local window_elapsed = (now % window) / window
 
-- Estimate using sliding window: prev * (1 - elapsed) + curr
local estimated = prev_count * (1 - window_elapsed) + curr_count
 
if estimated < limit then
    -- Allow: increment current window
    local new_count = redis.call('INCR', curr_key)
    if new_count == 1 then
        redis.call('EXPIRE', curr_key, window * 2)
    end
    local reset_at = (current_window_id + 1) * window
    local remaining = limit - math.floor(estimated) - 1
    return {1, math.floor(estimated) + 1, math.max(0, remaining), reset_at}
else
    -- Deny
    local reset_at = (current_window_id + 1) * window
    return {0, math.floor(estimated), 0, reset_at}
end

Using the Script in Python

import redis
import time
 
 
class AtomicSlidingWindowLimiter:
 
    SCRIPT = """
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local current_window_id = math.floor(now / window)
local prev_window_id = current_window_id - 1
local curr_key = key .. ':' .. current_window_id
local prev_key = key .. ':' .. prev_window_id
local curr_count = tonumber(redis.call('GET', curr_key) or '0')
local prev_count = tonumber(redis.call('GET', prev_key) or '0')
local window_elapsed = (now % window) / window
local estimated = prev_count * (1 - window_elapsed) + curr_count
if estimated < limit then
    local new_count = redis.call('INCR', curr_key)
    if new_count == 1 then redis.call('EXPIRE', curr_key, window * 2) end
    local reset_at = (current_window_id + 1) * window
    return {1, math.floor(estimated) + 1, math.max(0, limit - math.floor(estimated) - 1), reset_at}
else
    return {0, math.floor(estimated), 0, (current_window_id + 1) * window}
end
"""
 
    def __init__(self, r: redis.Redis, limit: int, window_seconds: int):
        self.r = r
        self.limit = limit
        self.window_seconds = window_seconds
        self._script = r.register_script(self.SCRIPT)
 
    def is_allowed(self, identifier: str) -> dict:
        now = int(time.time())
        key = f"rl:sw:{identifier}"
        result = self._script(keys=[key], args=[self.limit, self.window_seconds, now])
 
        return {
            "allowed": bool(int(result[0])),
            "current_count": int(result[1]),
            "remaining": int(result[2]),
            "reset_at": int(result[3]),
            "limit": self.limit
        }

Important Lua Script Constraints

1. Lua scripts must be deterministic: same inputs -> same outputs
2. Do NOT use time() in Lua - it is non-deterministic in replication
   Instead: pass current time as an ARGV argument from the client
3. All keys a Lua script will access MUST be declared in KEYS[]
   (Required for Redis Cluster to route correctly)
4. Keep Lua scripts short: they block other Redis commands while running
5. Use EVALSHA instead of EVAL in production to avoid resending the script

6. Redis Transactions vs Lua Scripts

MULTI/EXEC Transactions

# Redis MULTI/EXEC: optimistic locking with WATCH
def is_allowed_with_transaction(r, key, limit):
    with r.pipeline() as pipe:
        while True:
            try:
                # Watch the key for changes
                pipe.watch(key)
                count = int(pipe.get(key) or 0)
 
                if count >= limit:
                    pipe.reset()
                    return False
 
                # Start transaction
                pipe.multi()
                pipe.incr(key)
                pipe.expire(key, 60)
                pipe.execute()  # Fails if key changed since WATCH
                return True
 
            except redis.WatchError:
                # Another client modified the key - retry
                continue  # This loop is unbounded - dangerous under high contention!

Why Lua Scripts Win Over MULTI/EXEC

AspectMULTI/EXECLua Script
AtomicityYes (if no WATCH conflict)Yes (always)
Retry neededYes (on WATCH conflict)No
Network round trips3+ (WATCH, MULTI, EXEC)1 (EVAL)
PerformanceDegrades under contentionConstant
ComplexityHigher (retry loop)Lower
Use caseComplex logic with readsRate limiting (ideal)

Use Lua scripts for rate limiting. Always.


7. Redis Cluster Considerations

The Hash Slot Problem

Redis Cluster distributes data across 16,384 hash slots. Keys are assigned to slots based
on CRC16 hash. A rate limiter that uses multiple keys (e.g., current window + previous window)
can end up on DIFFERENT Redis nodes.

Lua scripts CANNOT access keys on different nodes. This will fail:

-- This will fail in Redis Cluster if key1 and key2 are on different nodes
local curr = redis.call('GET', KEYS[1])  -- node A
local prev = redis.call('GET', KEYS[2])  -- node B (CROSS-SLOT ERROR!)

Solution: Hash Tags

Force related keys to the same slot by using a hash tag {...}:

# Without hash tags: keys on different slots (BROKEN in cluster)
curr_key = f"rl:sw:{user_id}:curr"   # slot = CRC16("rl:sw:user123:curr") % 16384
prev_key = f"rl:sw:{user_id}:prev"   # slot = CRC16("rl:sw:user123:prev") % 16384
 
# With hash tags: same user, same slot (CORRECT)
curr_key = f"rl:sw:{{{user_id}}}:curr"   # slot = CRC16("user123") % 16384
prev_key = f"rl:sw:{{{user_id}}}:prev"   # slot = CRC16("user123") % 16384
# Both keys go to the same slot because {user_id} determines the hash

Lua Script with Hash Tags

SCRIPT = """
local user_id = ARGV[4]
local curr_key = 'rl:sw:{' .. user_id .. '}:curr'
local prev_key  = 'rl:sw:{' .. user_id .. '}:prev'
-- Now both keys are guaranteed to be on the same slot
local curr = tonumber(redis.call('GET', curr_key) or '0')
local prev = tonumber(redis.call('GET', prev_key) or '0')
...
"""

Redis Sentinel vs Cluster vs Standalone

ModeUse CaseRate Limiting Notes
StandaloneDev, small scaleSimplest. Works with all patterns.
SentinelHA, no shardingAutomatic failover. Same as standalone for rate limiting.
ClusterVery high QPS, large dataUse hash tags! All keys per user must be co-located.

8. Sticky Sessions Approach

What It Is

With sticky sessions (session affinity), the load balancer routes ALL requests from a
specific user/IP to the SAME application server. That server can then do local in-memory
rate limiting.

[User 123] --> [Load Balancer] --> [Server A] (always, based on user ID hash)
[User 456] --> [Load Balancer] --> [Server B] (always)
[User 789] --> [Load Balancer] --> [Server C] (always)

Implementation

# Nginx sticky session based on IP
upstream backend {
    ip_hash;
    server server1:8080;
    server server2:8080;
    server server3:8080;
}
 
# AWS ELB: enable "Stickiness" in target group settings
# Stickiness type: Application-based / LB-based
# Duration: 1 day

When Sticky Sessions Break Down

Sticky sessions seem like an elegant solution but fail in several scenarios:

  1. Server failure: When Server A dies, all its users are redistributed. Their rate
    limit counters reset to zero. A user who was at 99/100 requests can immediately make
    another 100 requests.

  2. Scaling out: Adding a new server changes the hash distribution. Users that were
    on Server A may move to Server D. Counters reset.

  3. Mobile users: Mobile users change IPs frequently (WiFi to cellular). IP-hash
    sticky sessions break with IP changes.

  4. Corporate proxies: Thousands of users behind the same proxy IP all go to the
    same server. That server gets overloaded.

Conclusion: Sticky sessions are not a reliable solution for rate limiting. Use
centralized Redis instead.


9. Hybrid: Local Approximate + Global Precise

The Problem to Solve

Centralized Redis adds 1-5ms per request. For endpoints serving 10,000 RPS per instance,
that is 50ms of extra latency per second of CPU time just waiting for Redis.

The Hybrid Approach

Keep a local counter that approximates the rate, and only sync with Redis periodically.

[Server A: local_count=15, synced_5s_ago] -----> [Redis: global_count=47]
[Server B: local_count=12, synced_3s_ago] -/
[Server C: local_count=20, synced_7s_ago] -/

Implementation

import time
import redis
import threading
from dataclasses import dataclass, field
 
 
@dataclass
class LocalCounter:
    count: int = 0
    last_sync: float = field(default_factory=time.time)
 
 
class HybridRateLimiter:
    """
    Hybrid rate limiter: local approximation + global Redis enforcement.
 
    Strategy:
    1. Each instance maintains a local counter
    2. Requests that pass the local check are then validated globally in Redis
    3. Local sync happens every `sync_interval` seconds
 
    This reduces Redis calls by (sync_interval * local_rps) factor.
    """
 
    def __init__(
        self,
        r: redis.Redis,
        limit: int,
        window_seconds: int,
        # Each server reserves a "chunk" of the global limit
        # For 3 servers with limit=100: reserve_percent=0.4 -> each reserves 40
        reserve_percent: float = 0.4
    ):
        self.r = r
        self.limit = limit
        self.window_seconds = window_seconds
        self.local_limit = int(limit * reserve_percent)
        self.locals: dict[str, LocalCounter] = {}
        self.lock = threading.Lock()
        self._script = r.register_script("""
local key = KEYS[1]
local add = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local window = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
local window_id = math.floor(now / window)
local full_key = key .. ':' .. window_id
local count = redis.call('INCRBY', full_key, add)
if count == add then
    redis.call('EXPIRE', full_key, window * 2)
end
if count <= limit then
    return {1, count}
else
    -- Refund: we over-counted, subtract back
    redis.call('DECRBY', full_key, add)
    return {0, count - add}
end
""")
 
    def is_allowed(self, identifier: str, cost: int = 1) -> bool:
        with self.lock:
            if identifier not in self.locals:
                self.locals[identifier] = LocalCounter()
            local = self.locals[identifier]
 
            # Phase 1: Fast local check
            if local.count + cost > self.local_limit:
                # Local quota exhausted, go to Redis
                return self._global_check(identifier, cost)
 
            # Phase 2: Increment local counter
            local.count += cost
 
            # Phase 3: Periodically sync local counts to Redis
            now = time.time()
            if now - local.last_sync >= 1.0:  # sync every second
                self._sync_to_redis(identifier, local)
 
            return True
 
    def _global_check(self, identifier: str, cost: int) -> bool:
        now = int(time.time())
        key = f"rl:hybrid:{identifier}"
        result = self._script(keys=[key], args=[cost, self.limit, self.window_seconds, now])
        return bool(int(result[0]))
 
    def _sync_to_redis(self, identifier: str, local: LocalCounter) -> None:
        """Flush accumulated local counts to Redis."""
        if local.count == 0:
            return
        to_sync = local.count
        local.count = 0
        local.last_sync = time.time()
 
        now = int(time.time())
        key = f"rl:hybrid:{identifier}"
        # Push accumulated local count to Redis
        try:
            self._script(keys=[key], args=[to_sync, self.limit, self.window_seconds, now])
        except redis.RedisError:
            # If Redis is down, restore local count and fail open
            local.count += to_sync

10. Rate Limiting in a Service Mesh

Envoy Proxy Rate Limiting

Envoy is the data plane used by Istio and many other service meshes. It supports both
local rate limiting (per Envoy instance) and global rate limiting (via external gRPC service).

# Envoy local rate limiting filter
- name: envoy.filters.http.local_ratelimit
  typed_config:
    "@type": type.googleapis.com/udpa.type.v1.TypedStruct
    type_url: type.googleapis.com/envoy.extensions.filters.http.local_ratelimit.v3.LocalRateLimit
    value:
      stat_prefix: http_local_rate_limiter
      token_bucket:
        max_tokens: 1000
        tokens_per_fill: 1000
        fill_interval: 1s
      filter_enabled:
        runtime_key: local_rate_limit_enabled
        default_value:
          numerator: 100
          denominator: HUNDRED
      response_headers_to_add:
        - append: false
          header:
            key: x-local-rate-limit
            value: "true"

Global Rate Limit Service (gRPC)

For true distributed rate limiting in a service mesh, Envoy calls out to a global rate
limit service over gRPC before forwarding each request.

# Envoy global rate limit config
rate_limits:
  - actions:
      - request_headers:
          header_name: x-user-id
          descriptor_key: user_id
      - request_headers:
          header_name: ":path"
          descriptor_key: path

The external rate limit service (e.g., Lyft's ratelimit service) then decides allow/deny
based on the descriptor and configured rules.

Istio Rate Limiting

# Istio EnvoyFilter for rate limiting
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: filter-ratelimit
spec:
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_INBOUND
        listener:
          filterChain:
            filter:
              name: envoy.filters.network.http_connection_manager
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.ratelimit
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
            domain: productpage-ratelimit
            failure_mode_deny: false # fail open
            rate_limit_service:
              grpc_service:
                envoy_grpc:
                  cluster_name: rate_limit_cluster
              transport_api_version: V3

11. Multi-Region Rate Limiting

The CAP Theorem Trade-Off

Multi-region rate limiting forces a fundamental choice:

  • Consistency: Every region enforces the exact same limit
  • Availability: Every region responds even if others are unreachable
  • Partition Tolerance: Required (network partitions between regions happen)

You must choose between C and A. Rate limiting typically favors Availability (AP):
it is better to allow a few extra requests than to block all requests because a cross-region
connection failed.

Approaches

Approach 1: Separate limits per region

User limit: 1000/minute globally
Region US-EAST: 500/minute
Region EU-WEST: 300/minute
Region AP-SOUTH: 200/minute
Total enforced: 1000/minute

Users are redirected to specific regions based on geography. Each region enforces its
own limit independently with no cross-region communication.

Limitation: A user who routes through a VPN or CDN can exceed their limit by switching
regions.

Approach 2: Eventual consistency with local caches

Each region caches the global count, refreshed every N seconds.
Local check: local_count < local_limit
Background sync: push local increments to global store periodically

Approach 3: Centralized global store with read replication

Write region (primary): All increments go to primary Redis
Read region (replica): Check reads can go to regional replica
Lag: Replica is ~5-50ms behind primary

This introduces a brief window where both primary and replica allow a request that the
global limit would deny.

Approach 4: Accept approximate limiting
For most APIs, allowing 5-10% over-provisioning due to multi-region lag is acceptable.
Enforce exact limits only on the most critical endpoints (payment APIs, etc.) using
synchronous cross-region calls.


12. Handling Redis Failures: Fail-Open vs Fail-Closed

The Dilemma

If your Redis rate limiter goes down, what should happen to incoming requests?

Fail-Open (Allow):

def is_allowed(identifier: str) -> bool:
    try:
        result = redis.execute_rate_limit_check(identifier)
        return result.allowed
    except redis.RedisError:
        logger.error("Redis down! Failing open (allowing all requests)")
        return True  # Allow when Redis is unavailable

Pros: Service stays available. Users not impacted.
Cons: Opens door to abuse. All rate limits bypassed.

Fail-Closed (Deny):

def is_allowed(identifier: str) -> bool:
    try:
        result = redis.execute_rate_limit_check(identifier)
        return result.allowed
    except redis.RedisError:
        logger.error("Redis down! Failing closed (denying all requests)")
        return False  # Deny when Redis is unavailable

Pros: Security maintained. No abuse possible during outage.
Cons: Service unavailable. All users impacted, not just bad actors.

import time
from enum import Enum
 
 
class CircuitState(Enum):
    CLOSED = "closed"      # Normal: using Redis
    OPEN = "open"          # Failure: using local fallback
    HALF_OPEN = "half_open"  # Testing: trying Redis again
 
 
class ResilientRateLimiter:
    """
    Rate limiter with circuit breaker pattern.
    Falls back to local in-memory limiting when Redis is unavailable.
    """
 
    def __init__(self, redis_limiter, local_limiter, failure_threshold=5, recovery_timeout=60):
        self.redis_limiter = redis_limiter
        self.local_limiter = local_limiter
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.state = CircuitState.CLOSED
        self.last_failure_time = None
 
    def is_allowed(self, identifier: str) -> dict:
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
            else:
                # Circuit open: use local limiter as fallback
                return self.local_limiter.is_allowed(identifier)
 
        try:
            result = self.redis_limiter.is_allowed(identifier)
            # Success: reset circuit
            if self.state == CircuitState.HALF_OPEN:
                self.state = CircuitState.CLOSED
                self.failure_count = 0
            return result
 
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
 
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
                # Alert! Rate limiter is down
                # alert_system.send("Redis rate limiter DOWN, using local fallback")
 
            # Fallback to local limiting
            return self.local_limiter.is_allowed(identifier)

Decision Guide: Fail-Open vs Fail-Closed

ScenarioRecommendation
Public API (general use)Fail-Open. User experience > perfect enforcement.
Payment / Financial APIFail-Closed or strict local fallback.
Authentication endpointsFail-Closed. Security critical.
Read-only endpointsFail-Open. No abuse risk.
Internal service callsFail-Open. All callers are trusted.
Free tier APIFail-Open with monitoring. Alert on anomalies.

Summary

ChallengeSolution
Multiple instances, no shared stateCentralized Redis rate limiter
Race conditionsRedis Lua scripts (atomic execution)
Clock skewPass timestamp as ARGV to Lua, not inside script
Redis Cluster key routingUse hash tags {user_id} in key names
High Redis latencyHybrid local + global approach
Redis failureCircuit breaker with local fallback
Multi-region accuracyAccept approximation or use primary-region enforcement
Sticky sessions failureDo not rely on sticky sessions for rate limiting

Next: Part 5 - Advanced Concepts and Industry Practices

Learn adaptive rate limiting, tiered systems, how Twitter, GitHub, and Stripe do it,
and the tips, pitfalls, and anti-patterns that matter most in production.