Rate Limiting Demystified - Part 2: Algorithms Deep Dive
Series Navigation:
Index |
Part 1 - Fundamentals |
Part 3 - Implementation |
Part 4 - Distributed |
Part 5 - Advanced |
Part 6 - Interview Questions
Table of Contents
- Fixed Window Counter
- Sliding Window Log
- Sliding Window Counter (Hybrid)
- Token Bucket
- Leaky Bucket
- GCRA - Generic Cell Rate Algorithm
- Algorithm Comparison Table
- Decision Framework: Choosing the Right Algorithm
1. Fixed Window Counter
How It Works
Divide time into fixed, discrete windows (e.g., every 60 seconds). Count requests within
each window. If the count exceeds the limit, reject.
Timeline:
|--- Window 1 ---|--- Window 2 ---|--- Window 3 ---|
| 0s 60s | 60s 120s | 120s 180s |
| count: 100 | count: 47 | count: 0 |
At second 0, the window resets to 0. Every request increments the counter. At second 60,
the window resets again.
The Boundary Problem (Critical Flaw)
The fixed window has one famous flaw: a user can send 2x the limit in a short time by
exploiting window boundaries.
Time: ... 11:59:45 11:59:55 12:00:05 12:00:15 ...
| | | |
Window: Window 1 (limit 100) | Window 2 (limit 100)
Requests: 100 requests at 11:59:50 (end of Window 1, all 100 allowed)
+ 100 requests at 12:00:10 (start of Window 2, all 100 allowed)
Result: 200 requests in 20 seconds!
Effective rate: 10x the intended 1 per second limit.
This is not a theoretical problem - it is exploited in practice.
Python Implementation
import time
from collections import defaultdict
import threading
class FixedWindowRateLimiter:
"""
Fixed Window Counter Rate Limiter.
Thread-safe implementation for single-process use.
For distributed use, replace with Redis INCR + EXPIRE.
"""
def __init__(self, limit: int, window_size_seconds: int):
self.limit = limit
self.window_size = window_size_seconds
self.counters = defaultdict(int)
self.lock = threading.Lock()
def _get_window_key(self, identifier: str) -> str:
current_window = int(time.time()) // self.window_size
return f"{identifier}:{current_window}"
def is_allowed(self, identifier: str) -> bool:
"""
Returns True if the request is allowed, False if rate limited.
"""
with self.lock:
key = self._get_window_key(identifier)
self.counters[key] += 1
return self.counters[key] <= self.limit
def get_remaining(self, identifier: str) -> int:
with self.lock:
key = self._get_window_key(identifier)
used = self.counters.get(key, 0)
return max(0, self.limit - used)
def get_reset_time(self) -> int:
"""Returns Unix timestamp when the current window resets."""
current_window = int(time.time()) // self.window_size
return (current_window + 1) * self.window_size
# Usage
limiter = FixedWindowRateLimiter(limit=100, window_size_seconds=60)
for i in range(105):
user_id = "user_123"
if limiter.is_allowed(user_id):
print(f"Request {i+1}: ALLOWED (remaining: {limiter.get_remaining(user_id)})")
else:
print(f"Request {i+1}: REJECTED (resets at: {limiter.get_reset_time()})")Redis Implementation
import redis
import time
def is_allowed_fixed_window(
redis_client: redis.Redis,
identifier: str,
limit: int,
window_seconds: int
) -> tuple[bool, dict]:
"""
Fixed window rate limiting using Redis INCR + EXPIRE.
Returns (is_allowed, metadata_dict).
"""
current_window = int(time.time()) // window_seconds
key = f"rl:fw:{identifier}:{current_window}"
# Pipeline for atomicity: INCR and EXPIRE in one round trip
pipe = redis_client.pipeline()
pipe.incr(key)
pipe.ttl(key)
results = pipe.execute()
count = results[0]
ttl = results[1]
# Set expiry only on first request in this window
if count == 1:
redis_client.expire(key, window_seconds * 2)
ttl = window_seconds
allowed = count <= limit
return allowed, {
"limit": limit,
"remaining": max(0, limit - count),
"reset_in": ttl,
"current_count": count
}Pros and Cons
| Pros | Cons |
|---|---|
| Simple to understand | Boundary problem allows 2x burst |
| Very low memory usage (1 counter per user) | Not accurate at window edges |
| O(1) time complexity | Sudden reset can cause spikes |
| Easy to implement with Redis INCR |
Best Used For
- Simple internal APIs where boundary accuracy is not critical
- Memory-constrained environments
- Quick prototypes
2. Sliding Window Log
How It Works
Instead of counting in discrete windows, keep a log (list) of timestamps of every request.
When a new request arrives:
- Remove all timestamps older than
now - window_size - Count remaining timestamps
- If count < limit, allow the request and add current timestamp
- Otherwise, reject
Window = 60 seconds, Limit = 5
Time: 10:00:00
Log: [empty]
Request arrives -> log: [10:00:00] -> count: 1 -> ALLOW
Time: 10:00:30
Log: [10:00:00]
Request arrives -> log: [10:00:00, 10:00:30] -> count: 2 -> ALLOW
Time: 10:01:00
Log: [10:00:00, 10:00:30] -- remove 10:00:00 (it's > 60s old)
Log: [10:00:30]
Request arrives -> log: [10:00:30, 10:01:00] -> count: 2 -> ALLOW
No boundary problem. The window truly slides with every request.
Python Implementation
import time
from collections import defaultdict, deque
import threading
class SlidingWindowLog:
"""
Sliding Window Log Rate Limiter.
Accurate but memory-intensive: O(limit) per user.
"""
def __init__(self, limit: int, window_size_seconds: float):
self.limit = limit
self.window_size = window_size_seconds
self.logs = defaultdict(deque) # user_id -> deque of timestamps
self.lock = threading.Lock()
def is_allowed(self, identifier: str) -> tuple[bool, dict]:
with self.lock:
now = time.time()
window_start = now - self.window_size
log = self.logs[identifier]
# Step 1: Evict timestamps outside the window
while log and log[0] <= window_start:
log.popleft()
# Step 2: Count current requests in window
current_count = len(log)
# Step 3: Check against limit
if current_count < self.limit:
log.append(now)
return True, {
"limit": self.limit,
"remaining": self.limit - current_count - 1,
"current_count": current_count + 1
}
else:
# Calculate when the oldest request will expire
retry_after = log[0] + self.window_size - now if log else 0
return False, {
"limit": self.limit,
"remaining": 0,
"retry_after": retry_after,
"current_count": current_count
}Redis Implementation (Sorted Sets)
import redis
import time
import uuid
def is_allowed_sliding_window_log(
redis_client: redis.Redis,
identifier: str,
limit: int,
window_seconds: int
) -> tuple[bool, dict]:
"""
Sliding window log using Redis Sorted Sets (ZSET).
Score = timestamp. Members = unique request IDs.
"""
now = time.time()
window_start = now - window_seconds
key = f"rl:swl:{identifier}"
# Use pipeline for batch execution
pipe = redis_client.pipeline()
# Remove timestamps outside the window
pipe.zremrangebyscore(key, 0, window_start)
# Count requests in window
pipe.zcard(key)
# Add current request with timestamp as score
pipe.zadd(key, {str(uuid.uuid4()): now})
# Set key expiry to auto-cleanup
pipe.expire(key, window_seconds * 2)
results = pipe.execute()
count_before = results[1] # count BEFORE adding this request
if count_before < limit:
return True, {
"limit": limit,
"remaining": limit - count_before - 1
}
else:
# Remove the request we just added (it was rejected)
# In production, use a Lua script to make this atomic
redis_client.zpopmax(key)
return False, {
"limit": limit,
"remaining": 0
}Important: The Redis pipeline above is NOT atomic. A concurrent request can sneak in
between steps. Use a Lua script for true atomicity. See Part 3 for the Lua implementation.
Pros and Cons
| Pros | Cons |
|---|---|
| Perfectly accurate - no boundary problem | High memory: O(requests) per user |
| True sliding window behavior | Memory grows with traffic |
| Fair to all users | Cleanup overhead |
Memory Calculation
If your limit is 1,000 requests/minute and you have 100,000 users:
- Each user: 1,000 timestamps x 8 bytes = ~8KB
- 100,000 users: ~800MB just for rate limit logs
This is why sliding window log is rarely used at scale without the hybrid counter approach.
3. Sliding Window Counter (Hybrid)
How It Works
This is the practical solution that balances accuracy with memory efficiency. It combines
fixed windows with a weighted estimate.
Key insight: Instead of storing every timestamp, store just two counters:
- Previous window count
- Current window count
Use a linear interpolation to estimate how many requests occurred in the true sliding window.
Formula:
estimated_count = previous_window_count * (1 - overlap) + current_window_count
where:
overlap = fraction of current window elapsed = elapsed_time / window_size
(1 - overlap) = fraction of previous window still "visible"
Visual Explanation
Window size = 60s, Current time = 12:01:15
Previous window: 12:00:00 - 12:01:00 (count: 84)
Current window: 12:01:00 - 12:02:00 (count: 23, elapsed: 15s)
Overlap of previous window = (60 - 15) / 60 = 0.75 (75% of previous window is still in range)
Estimated count = 84 * 0.75 + 23 = 63 + 23 = 86
If limit is 100: 86 < 100 -> ALLOW
This is an approximation with at most 1/limit error rate. For a limit of 1000, the maximum
error is 0.1%. Cloudflare uses this algorithm.
Python Implementation
import time
from dataclasses import dataclass, field
from collections import defaultdict
import threading
@dataclass
class WindowState:
prev_count: int = 0
curr_count: int = 0
curr_window_start: float = field(default_factory=time.time)
class SlidingWindowCounter:
"""
Sliding Window Counter (Hybrid) Rate Limiter.
Memory: O(1) per user (just two counters).
Accuracy: Within 0.1% error for typical limits.
Used by: Cloudflare, many production APIs.
"""
def __init__(self, limit: int, window_size_seconds: float):
self.limit = limit
self.window_size = window_size_seconds
self.states: dict[str, WindowState] = defaultdict(WindowState)
self.lock = threading.Lock()
def is_allowed(self, identifier: str) -> tuple[bool, dict]:
with self.lock:
now = time.time()
state = self.states[identifier]
# Check if we've moved past the current window
elapsed = now - state.curr_window_start
if elapsed >= self.window_size:
# Advance window: current becomes previous
state.prev_count = state.curr_count
state.curr_count = 0
state.curr_window_start = now
elapsed = 0
# What fraction of the previous window is still in the sliding range?
# (window_size - elapsed) / window_size
prev_weight = (self.window_size - elapsed) / self.window_size
# Weighted estimate
estimated_count = state.prev_count * prev_weight + state.curr_count
if estimated_count < self.limit:
state.curr_count += 1
remaining = self.limit - int(estimated_count) - 1
return True, {
"limit": self.limit,
"remaining": max(0, remaining),
"estimated_count": estimated_count + 1
}
else:
return False, {
"limit": self.limit,
"remaining": 0,
"retry_after": self.window_size - elapsed
}Pros and Cons
| Pros | Cons |
|---|---|
| Memory efficient: O(1) per user | Slightly approximate (small error margin) |
| No boundary spike problem | More complex than fixed window |
| Good accuracy for most use cases | Not perfectly accurate |
| Easy to implement in Redis | Not suitable for very low limits (< 10) |
Best Used For
This is the most commonly recommended algorithm for production REST APIs. It is what
Cloudflare uses internally for its rate limiting infrastructure.
4. Token Bucket
How It Works
Imagine a bucket that holds tokens. Rules:
- The bucket has a maximum capacity of N tokens
- Tokens are added at a constant rate R (e.g., 10 tokens per second)
- Each request consumes 1 token (or more for expensive operations)
- If the bucket has enough tokens, the request is allowed and tokens are consumed
- If not enough tokens, the request is rejected (or waits)
- If the bucket is full, new tokens are discarded (not overflow accumulated)
Capacity: 10 tokens
Refill rate: 2 tokens/second
Time 0: [**********] (10 tokens full)
Request arrives: consume 1 token
Time 0: [*********.] (9 tokens)
Time 1s: [**********] (refilled 2 tokens, but capped at 10)
10 requests arrive rapidly:
Time 1s: [..........] (0 tokens - all 10 used, all allowed)
11th request at Time 1s: REJECTED (no tokens)
Time 2s: [**.] (2 tokens refilled)
2 requests arrive: both allowed
Key Properties
-
Burst Support: Tokens accumulate when traffic is low. A user can save up tokens
and use them in a burst. This is the defining feature of token bucket. -
Smooth Refill: Tokens are added continuously, not in batches. At 2 tokens/second,
a token is available every 500ms. -
Decoupled capacity and rate:
capacitycontrols maximum burst size.ratecontrols
sustained throughput. You can set capacity=100, rate=10/s to allow large bursts but
sustain only 10 RPS.
Python Implementation
import time
import threading
from dataclasses import dataclass
@dataclass
class BucketState:
tokens: float
last_refill: float
class TokenBucket:
"""
Token Bucket Rate Limiter.
Supports burst traffic while enforcing average rate.
Used by: AWS API Gateway, Stripe, Nginx (burst param).
"""
def __init__(self, capacity: float, refill_rate: float):
"""
capacity: maximum tokens in the bucket (also max burst size)
refill_rate: tokens added per second
"""
self.capacity = capacity
self.refill_rate = refill_rate
self.states: dict[str, BucketState] = {}
self.lock = threading.Lock()
def _get_state(self, identifier: str) -> BucketState:
if identifier not in self.states:
self.states[identifier] = BucketState(
tokens=self.capacity,
last_refill=time.time()
)
return self.states[identifier]
def _refill(self, state: BucketState, now: float) -> None:
elapsed = now - state.last_refill
tokens_to_add = elapsed * self.refill_rate
state.tokens = min(self.capacity, state.tokens + tokens_to_add)
state.last_refill = now
def consume(self, identifier: str, tokens: float = 1.0) -> tuple[bool, dict]:
"""
Attempt to consume 'tokens' from the bucket.
Returns (allowed, metadata).
"""
with self.lock:
now = time.time()
state = self._get_state(identifier)
self._refill(state, now)
if state.tokens >= tokens:
state.tokens -= tokens
wait_time = 0
return True, {
"allowed": True,
"tokens_remaining": state.tokens,
"tokens_consumed": tokens
}
else:
# How long until enough tokens are available?
tokens_needed = tokens - state.tokens
wait_time = tokens_needed / self.refill_rate
return False, {
"allowed": False,
"tokens_remaining": state.tokens,
"retry_after_seconds": wait_time
}
def try_consume_or_wait(
self, identifier: str,
tokens: float = 1.0,
max_wait_seconds: float = 0
) -> bool:
"""
Optionally block/wait until a token is available.
max_wait_seconds=0 means never wait (pure non-blocking).
"""
allowed, meta = self.consume(identifier, tokens)
if allowed:
return True
retry_after = meta.get("retry_after_seconds", 0)
if max_wait_seconds > 0 and retry_after <= max_wait_seconds:
time.sleep(retry_after)
return self.try_consume_or_wait(identifier, tokens, 0)
return False
# Usage Example
bucket = TokenBucket(capacity=10, refill_rate=2) # 10 burst, 2/second sustained
# Fast burst - first 10 allowed
for i in range(12):
allowed, meta = bucket.consume("user_123")
print(f"Request {i+1}: {'ALLOW' if allowed else 'DENY'} "
f"(tokens: {meta.get('tokens_remaining', 0):.1f})")
# Wait for tokens to refill
time.sleep(1)
print("\nAfter 1 second (2 tokens refilled):")
for i in range(3):
allowed, meta = bucket.consume("user_123")
print(f"Request {i+1}: {'ALLOW' if allowed else 'DENY'} "
f"(tokens: {meta.get('tokens_remaining', 0):.1f})")Cost-Based Token Consumption
A powerful feature: different operations consume different numbers of tokens.
class CostBasedTokenBucket(TokenBucket):
"""
Different operations cost different numbers of tokens.
Example: GraphQL queries with varying complexity.
"""
OPERATION_COSTS = {
"simple_read": 1,
"list_query": 3,
"complex_search": 10,
"export": 25,
"bulk_delete": 50,
}
def is_operation_allowed(self, identifier: str, operation: str) -> bool:
cost = self.OPERATION_COSTS.get(operation, 1)
allowed, _ = self.consume(identifier, tokens=cost)
return allowedPros and Cons
| Pros | Cons |
|---|---|
| Allows controlled bursting | Slightly more complex than fixed window |
| Continuous token refill is fair | Requires float arithmetic |
| Natural fit for sustained rate limits | State per user needed |
| Supports weighted operations |
Best Used For
- APIs where users should be allowed to burst occasionally
- Rate limiting expensive operations with different costs
- SDKs and client libraries
- Stripe, AWS API Gateway, and many cloud provider APIs use this
5. Leaky Bucket
How It Works
The leaky bucket analogy: water (requests) enters a bucket from the top. The bucket has
a hole at the bottom that lets water drip out at a constant rate. If the bucket overflows
(is full), additional water (requests) is discarded.
Key properties:
- Requests are QUEUED in the bucket (not immediately processed)
- Processing happens at a FIXED, CONSTANT rate
- If the queue is full, new requests are dropped
- Output traffic is always smooth and uniform
Incoming: ||||| ||||||||||| || ||||||||||||||||||||
(burst) (burst) (big burst, some dropped)
Processing: | | | | | | | | | | | | | | | | | | | | |
(steady, uniform output - all requests processed at same rate)
The key difference from token bucket:
- Token bucket: processes requests immediately if tokens available (bursty output OK)
- Leaky bucket: always outputs at the same rate regardless of input (smooth output)
Python Implementation (Queue-Based)
import queue
import threading
import time
from typing import Callable, Any
class LeakyBucket:
"""
Leaky Bucket Rate Limiter.
Output rate is always constant regardless of input bursts.
Used for: Traffic shaping, smoothing output to downstream services.
"""
def __init__(self, capacity: int, leak_rate: float):
"""
capacity: max requests the queue can hold
leak_rate: requests processed per second (constant)
"""
self.capacity = capacity
self.leak_rate = leak_rate
self.queue = queue.Queue(maxsize=capacity)
self.running = False
self._processor_thread = None
def submit(self, request: Any) -> bool:
"""
Submit a request to the bucket.
Returns True if accepted (queued), False if dropped (bucket full).
"""
try:
self.queue.put_nowait(request)
return True
except queue.Full:
return False # Drop the request
def start(self, processor: Callable[[Any], None]) -> None:
"""Start the background processor that leaks at a fixed rate."""
self.running = True
self._processor_thread = threading.Thread(
target=self._leak, args=(processor,), daemon=True
)
self._processor_thread.start()
def _leak(self, processor: Callable[[Any], None]) -> None:
interval = 1.0 / self.leak_rate
while self.running:
try:
request = self.queue.get(timeout=interval)
processor(request)
except queue.Empty:
pass
time.sleep(interval)
def stop(self) -> None:
self.running = False
# Usage Example
def process_request(req):
print(f"Processing: {req} at {time.time():.2f}")
bucket = LeakyBucket(capacity=20, leak_rate=5) # queue 20, process 5/second
bucket.start(process_request)
# Send a burst of 30 requests instantly
for i in range(30):
accepted = bucket.submit(f"request_{i+1}")
if not accepted:
print(f"request_{i+1}: DROPPED (bucket full)")
time.sleep(5) # Watch them drain at 5/second
bucket.stop()As a Counter (Redis-Compatible)
The leaky bucket can also be implemented as a counter algorithm (without a real queue):
class LeakyBucketCounter:
"""
Leaky bucket as a counter - useful for Redis implementation.
Instead of a real queue, tracks the "water level" as a number.
"""
def __init__(self, capacity: int, leak_rate: float):
self.capacity = capacity
self.leak_rate = leak_rate # units per second
self.levels: dict[str, dict] = {}
self.lock = threading.Lock()
def is_allowed(self, identifier: str) -> bool:
with self.lock:
now = time.time()
if identifier not in self.levels:
self.levels[identifier] = {
"level": 0.0,
"last_check": now
}
state = self.levels[identifier]
elapsed = now - state["last_check"]
# Leak out based on elapsed time
leaked = elapsed * self.leak_rate
state["level"] = max(0.0, state["level"] - leaked)
state["last_check"] = now
# Check if there's room
if state["level"] + 1 <= self.capacity:
state["level"] += 1
return True
return FalsePros and Cons
| Pros | Cons |
|---|---|
| Perfectly smooth output traffic | Does NOT allow bursting |
| Protects downstream at a precise rate | Queue adds latency |
| Simple to reason about output | Requests may wait a long time in queue |
| Good for traffic shaping | Drop policy discards valid requests |
Token Bucket vs Leaky Bucket
| Scenario | Use Token Bucket | Use Leaky Bucket |
|---|---|---|
| API rate limiting (user-facing) | Yes - allow bursts | Usually not |
| Calling a payment gateway | Yes - occasional bursts OK | OK if smooth calls needed |
| Protecting a fragile DB | Maybe | Yes - DB gets predictable load |
| Traffic shaping at network level | No | Yes |
| Allowing short spikes | Yes | No |
6. GCRA - Generic Cell Rate Algorithm
Overview
GCRA (Generic Cell Rate Algorithm), also known as the Virtual Scheduling Algorithm, originated
in ATM (Asynchronous Transfer Mode) networking. It is a highly precise, elegant algorithm
that provides the same behavior as leaky bucket but with O(1) memory per client.
GCRA is used by Stripe's rate limiter (open-sourced in their Throttled library).
Core Concept: Theoretical Arrival Time (TAT)
The key concept in GCRA is the Theoretical Arrival Time (TAT) - when the next request
should theoretically arrive if traffic is perfectly spaced.
emission_interval = 1 / rate (time between two consecutive allowed requests)
TAT = max(now, previous_TAT) + emission_interval
If TAT <= now + burst_tolerance:
Allow the request
Update TAT
Else:
Reject (too early)
Example
Rate: 10 requests/second
Emission interval: 100ms
Burst tolerance: 500ms (allows up to 5 extra requests in a burst)
Request 1 at t=0ms:
TAT = max(0, 0) + 100 = 100ms
100 <= 0 + 500? -> 100 <= 500 -> YES -> ALLOW, TAT = 100ms
Request 2 at t=0ms (immediately after):
TAT = max(0, 100) + 100 = 200ms
200 <= 0 + 500? -> 200 <= 500 -> YES -> ALLOW, TAT = 200ms
Request 6 at t=0ms (6th rapid request):
TAT = max(0, 500) + 100 = 600ms
600 <= 0 + 500? -> 600 <= 500? -> NO -> REJECT
Request after 600ms:
TAT = max(600, 500) + 100 = 700ms
700 <= 600 + 500? -> 700 <= 1100? -> YES -> ALLOW
Python Implementation
import time
from dataclasses import dataclass
import threading
class GCRARateLimiter:
"""
GCRA (Generic Cell Rate Algorithm) Rate Limiter.
Also known as: Virtual Scheduling Algorithm, Leaky Bucket as a Meter.
Used by: Stripe's throttled library.
Memory: O(1) per user (single timestamp).
"""
def __init__(self, rate: float, burst: int):
"""
rate: requests per second allowed (sustained)
burst: maximum burst size above the sustained rate
"""
self.emission_interval = 1.0 / rate # seconds between requests at sustained rate
self.burst_offset = burst * self.emission_interval # TAT tolerance
self.tat_store: dict[str, float] = {} # user -> TAT
self.lock = threading.Lock()
def is_allowed(self, identifier: str) -> tuple[bool, dict]:
with self.lock:
now = time.time()
# Get or initialize TAT
previous_tat = self.tat_store.get(identifier, now)
# New TAT = max(now, previous_TAT) + emission_interval
new_tat = max(now, previous_tat) + self.emission_interval
# Allowed if new TAT is within burst tolerance
allowed_at = new_tat - self.burst_offset
if allowed_at <= now:
self.tat_store[identifier] = new_tat
return True, {
"allowed": True,
"next_allowed_in": 0
}
else:
# How long until the request would be allowed?
wait_time = allowed_at - now
return False, {
"allowed": False,
"retry_after_seconds": wait_time
}
# Usage
gcra = GCRARateLimiter(rate=10, burst=5) # 10 RPS sustained, 5 burst
# Test burst
for i in range(8):
allowed, meta = gcra.is_allowed("user_123")
print(f"Request {i+1}: {'ALLOW' if allowed else f'DENY (retry in {meta[\"retry_after_seconds\"]:.3f}s)'}")GCRA vs Token Bucket Comparison
Both achieve the same effective behavior, but implement it differently:
Token Bucket:
- Tracks: (current_tokens, last_refill_time)
- Memory: 2 values per user
- Conceptual: "How many tokens do I have?"
GCRA:
- Tracks: (TAT - theoretical arrival time)
- Memory: 1 value per user
- Conceptual: "When should the next request theoretically arrive?"
Pros and Cons
| Pros | Cons |
|---|---|
| Minimal memory: 1 timestamp per user | Less intuitive to understand |
| Highly precise | Harder to explain to stakeholders |
| Naturally allows burst | Less widely known |
| Used in production at Stripe |
7. Algorithm Comparison Table
| Property | Fixed Window | Sliding Window Log | Sliding Window Counter | Token Bucket | Leaky Bucket | GCRA |
|---|---|---|---|---|---|---|
| Memory per user | O(1) | O(limit) | O(1) | O(1) | O(capacity) | O(1) |
| Accuracy | Low (boundary bug) | Highest | High (~1% error) | High | High | High |
| Burst support | No | No | No | Yes | No | Yes (configurable) |
| Smooth output | No | No | No | No | Yes | No |
| Implementation complexity | Very low | Medium | Medium | Low | Medium | Medium |
| Redis-friendly | Yes | Yes (ZSET) | Yes | Yes (Lua) | Yes | Yes (Lua) |
| Real-world usage | Very common | Rare at scale | Very common | Very common | Niche | Niche (Stripe) |
| Time complexity | O(1) | O(log n) ZSET ops | O(1) | O(1) | O(1) | O(1) |
8. Decision Framework: Choosing the Right Algorithm
Start with these questions:
Q1: Do you need to allow burst traffic?
- YES -> Token Bucket or GCRA
- NO -> Any algorithm works; prefer Sliding Window Counter for accuracy
Q2: How much memory can you afford per user?
- Very constrained -> Fixed Window (1 int) or Token Bucket (2 floats) or GCRA (1 float)
- Moderate -> Sliding Window Counter (2 ints + 1 float)
- Generous -> Sliding Window Log (but only for low-limit, high-accuracy needs)
Q3: How accurate must the limit be?
- Exact accuracy required (payments, security) -> Sliding Window Log with Lua
- Approximate is fine (general API) -> Sliding Window Counter or Token Bucket
- Rough is fine (DOS protection) -> Fixed Window
Q4: Do you need smooth output to protect a downstream service?
- YES -> Leaky Bucket
- NO -> Any other algorithm
Q5: How complex can your implementation be?
- Simple/quick -> Fixed Window
- Standard -> Token Bucket or Sliding Window Counter
- Advanced -> GCRA
Recommendation by Use Case
| Use Case | Recommended Algorithm | Reason |
|---|---|---|
| General REST API rate limiting | Sliding Window Counter | Accuracy + memory efficiency |
| Public API with burst allowance | Token Bucket | Burst support + simplicity |
| Security: login brute force | Sliding Window Log | Accuracy is critical |
| Payment API rate limiting | Token Bucket or GCRA | Precise control |
| Traffic shaping to a fragile DB | Leaky Bucket | Smooth, predictable output |
| Simple internal API | Fixed Window | Good enough, very simple |
| High-volume SDK | Token Bucket | SDK clients need burst support |
| GraphQL with query cost | Token Bucket (cost-based) | Variable token consumption |
Real-World Choices
| Company / Product | Algorithm Used |
|---|---|
| Stripe | GCRA (via Throttled library) |
| GitHub REST API | Fixed Window (per hour) |
| GitHub GraphQL API | Token Bucket (point-based) |
| Cloudflare | Sliding Window Counter |
| AWS API Gateway | Token Bucket |
| Nginx | Token Bucket (limit_req with burst) |
| Kong Gateway | Sliding Window (configurable) |
| Twitter/X v2 API | Fixed Window per 15-min window |
Summary
Fixed Window: Simple, low memory. Boundary bug allows 2x bursts.
Sliding Window Log: Accurate, high memory. Best for critical endpoints.
Sliding Window Counter: Accurate enough, low memory. Best general-purpose choice.
Token Bucket: Allows bursting. Best for user-facing APIs.
Leaky Bucket: Smooth output. Best for protecting downstream services.
GCRA: Like token bucket but 1 value per user. Used by Stripe.
Next: Part 3 - Implementation Guide
See production-ready code for all these algorithms using Redis, Java/Spring Boot,
Python, Node.js, Nginx, and API Gateways.