Rate Limiting Demystified - Part 2: Algorithms Deep Dive

Fixed Window Counter
Sliding Window Log
Sliding Window Counter (Hybrid)
Token Bucket
Leaky Bucket
GCRA - Generic Cell Rate Algorithm
Algorithm Comparison Table
Decision Framework: Choosing the Right Algorithm

1. Fixed Window Counter

How It Works

Divide time into fixed, discrete windows (e.g., every 60 seconds). Count requests within
each window. If the count exceeds the limit, reject.

Timeline:
|--- Window 1 ---|--- Window 2 ---|--- Window 3 ---|
| 0s        60s  | 60s      120s  | 120s     180s  |
|  count: 100    |  count: 47     |  count: 0      |

At second 0, the window resets to 0. Every request increments the counter. At second 60,
the window resets again.

The Boundary Problem (Critical Flaw)

The fixed window has one famous flaw: a user can send 2x the limit in a short time by
exploiting window boundaries.

Time:    ... 11:59:45  11:59:55  12:00:05  12:00:15 ...
                 |          |         |          |
Window:  Window 1 (limit 100) | Window 2 (limit 100)

Requests:  100 requests at 11:59:50  (end of Window 1, all 100 allowed)
         + 100 requests at 12:00:10  (start of Window 2, all 100 allowed)

Result: 200 requests in 20 seconds!
Effective rate: 10x the intended 1 per second limit.

This is not a theoretical problem - it is exploited in practice.

Python Implementation

import time
from collections import defaultdict
import threading
 
 
class FixedWindowRateLimiter:
    """
    Fixed Window Counter Rate Limiter.
    Thread-safe implementation for single-process use.
    For distributed use, replace with Redis INCR + EXPIRE.
    """
 
    def __init__(self, limit: int, window_size_seconds: int):
        self.limit = limit
        self.window_size = window_size_seconds
        self.counters = defaultdict(int)
        self.lock = threading.Lock()
 
    def _get_window_key(self, identifier: str) -> str:
        current_window = int(time.time()) // self.window_size
        return f"{identifier}:{current_window}"
 
    def is_allowed(self, identifier: str) -> bool:
        """
        Returns True if the request is allowed, False if rate limited.
        """
        with self.lock:
            key = self._get_window_key(identifier)
            self.counters[key] += 1
            return self.counters[key] <= self.limit
 
    def get_remaining(self, identifier: str) -> int:
        with self.lock:
            key = self._get_window_key(identifier)
            used = self.counters.get(key, 0)
            return max(0, self.limit - used)
 
    def get_reset_time(self) -> int:
        """Returns Unix timestamp when the current window resets."""
        current_window = int(time.time()) // self.window_size
        return (current_window + 1) * self.window_size
 
 
# Usage
limiter = FixedWindowRateLimiter(limit=100, window_size_seconds=60)
 
for i in range(105):
    user_id = "user_123"
    if limiter.is_allowed(user_id):
        print(f"Request {i+1}: ALLOWED (remaining: {limiter.get_remaining(user_id)})")
    else:
        print(f"Request {i+1}: REJECTED (resets at: {limiter.get_reset_time()})")

Redis Implementation

import redis
import time
 
 
def is_allowed_fixed_window(
    redis_client: redis.Redis,
    identifier: str,
    limit: int,
    window_seconds: int
) -> tuple[bool, dict]:
    """
    Fixed window rate limiting using Redis INCR + EXPIRE.
    Returns (is_allowed, metadata_dict).
    """
    current_window = int(time.time()) // window_seconds
    key = f"rl:fw:{identifier}:{current_window}"
 
    # Pipeline for atomicity: INCR and EXPIRE in one round trip
    pipe = redis_client.pipeline()
    pipe.incr(key)
    pipe.ttl(key)
    results = pipe.execute()
 
    count = results[0]
    ttl = results[1]
 
    # Set expiry only on first request in this window
    if count == 1:
        redis_client.expire(key, window_seconds * 2)
        ttl = window_seconds
 
    allowed = count <= limit
    return allowed, {
        "limit": limit,
        "remaining": max(0, limit - count),
        "reset_in": ttl,
        "current_count": count
    }

Pros and Cons

Pros	Cons
Simple to understand	Boundary problem allows 2x burst
Very low memory usage (1 counter per user)	Not accurate at window edges
O(1) time complexity	Sudden reset can cause spikes
Easy to implement with Redis INCR

Best Used For

Simple internal APIs where boundary accuracy is not critical
Memory-constrained environments
Quick prototypes

2. Sliding Window Log

How It Works

Instead of counting in discrete windows, keep a log (list) of timestamps of every request.
When a new request arrives:

Remove all timestamps older than now - window_size
Count remaining timestamps
If count < limit, allow the request and add current timestamp
Otherwise, reject

Window = 60 seconds, Limit = 5

Time: 10:00:00
Log: [empty]
Request arrives -> log: [10:00:00] -> count: 1 -> ALLOW

Time: 10:00:30
Log: [10:00:00]
Request arrives -> log: [10:00:00, 10:00:30] -> count: 2 -> ALLOW

Time: 10:01:00
Log: [10:00:00, 10:00:30] -- remove 10:00:00 (it's > 60s old)
Log: [10:00:30]
Request arrives -> log: [10:00:30, 10:01:00] -> count: 2 -> ALLOW

No boundary problem. The window truly slides with every request.

Python Implementation

import time
from collections import defaultdict, deque
import threading
 
 
class SlidingWindowLog:
    """
    Sliding Window Log Rate Limiter.
    Accurate but memory-intensive: O(limit) per user.
    """
 
    def __init__(self, limit: int, window_size_seconds: float):
        self.limit = limit
        self.window_size = window_size_seconds
        self.logs = defaultdict(deque)  # user_id -> deque of timestamps
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> tuple[bool, dict]:
        with self.lock:
            now = time.time()
            window_start = now - self.window_size
            log = self.logs[identifier]
 
            # Step 1: Evict timestamps outside the window
            while log and log[0] <= window_start:
                log.popleft()
 
            # Step 2: Count current requests in window
            current_count = len(log)
 
            # Step 3: Check against limit
            if current_count < self.limit:
                log.append(now)
                return True, {
                    "limit": self.limit,
                    "remaining": self.limit - current_count - 1,
                    "current_count": current_count + 1
                }
            else:
                # Calculate when the oldest request will expire
                retry_after = log[0] + self.window_size - now if log else 0
                return False, {
                    "limit": self.limit,
                    "remaining": 0,
                    "retry_after": retry_after,
                    "current_count": current_count
                }

Redis Implementation (Sorted Sets)

import redis
import time
import uuid
 
 
def is_allowed_sliding_window_log(
    redis_client: redis.Redis,
    identifier: str,
    limit: int,
    window_seconds: int
) -> tuple[bool, dict]:
    """
    Sliding window log using Redis Sorted Sets (ZSET).
    Score = timestamp. Members = unique request IDs.
    """
    now = time.time()
    window_start = now - window_seconds
    key = f"rl:swl:{identifier}"
 
    # Use pipeline for batch execution
    pipe = redis_client.pipeline()
    # Remove timestamps outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Count requests in window
    pipe.zcard(key)
    # Add current request with timestamp as score
    pipe.zadd(key, {str(uuid.uuid4()): now})
    # Set key expiry to auto-cleanup
    pipe.expire(key, window_seconds * 2)
    results = pipe.execute()
 
    count_before = results[1]  # count BEFORE adding this request
 
    if count_before < limit:
        return True, {
            "limit": limit,
            "remaining": limit - count_before - 1
        }
    else:
        # Remove the request we just added (it was rejected)
        # In production, use a Lua script to make this atomic
        redis_client.zpopmax(key)
        return False, {
            "limit": limit,
            "remaining": 0
        }

Important: The Redis pipeline above is NOT atomic. A concurrent request can sneak in
between steps. Use a Lua script for true atomicity. See Part 3 for the Lua implementation.

Pros and Cons

Pros	Cons
Perfectly accurate - no boundary problem	High memory: O(requests) per user
True sliding window behavior	Memory grows with traffic
Fair to all users	Cleanup overhead

Memory Calculation

If your limit is 1,000 requests/minute and you have 100,000 users:

Each user: 1,000 timestamps x 8 bytes = ~8KB
100,000 users: ~800MB just for rate limit logs

This is why sliding window log is rarely used at scale without the hybrid counter approach.

3. Sliding Window Counter (Hybrid)

How It Works

This is the practical solution that balances accuracy with memory efficiency. It combines
fixed windows with a weighted estimate.

Key insight: Instead of storing every timestamp, store just two counters:

Previous window count
Current window count

Use a linear interpolation to estimate how many requests occurred in the true sliding window.

Formula:

estimated_count = previous_window_count * (1 - overlap) + current_window_count

where:
  overlap = fraction of current window elapsed = elapsed_time / window_size
  (1 - overlap) = fraction of previous window still "visible"

Visual Explanation

Window size = 60s, Current time = 12:01:15

Previous window: 12:00:00 - 12:01:00 (count: 84)
Current window:  12:01:00 - 12:02:00 (count: 23, elapsed: 15s)

Overlap of previous window = (60 - 15) / 60 = 0.75 (75% of previous window is still in range)

Estimated count = 84 * 0.75 + 23 = 63 + 23 = 86

If limit is 100: 86 < 100 -> ALLOW

This is an approximation with at most 1/limit error rate. For a limit of 1000, the maximum
error is 0.1%. Cloudflare uses this algorithm.

Python Implementation

import time
from dataclasses import dataclass, field
from collections import defaultdict
import threading
 
 
@dataclass
class WindowState:
    prev_count: int = 0
    curr_count: int = 0
    curr_window_start: float = field(default_factory=time.time)
 
 
class SlidingWindowCounter:
    """
    Sliding Window Counter (Hybrid) Rate Limiter.
    Memory: O(1) per user (just two counters).
    Accuracy: Within 0.1% error for typical limits.
    Used by: Cloudflare, many production APIs.
    """
 
    def __init__(self, limit: int, window_size_seconds: float):
        self.limit = limit
        self.window_size = window_size_seconds
        self.states: dict[str, WindowState] = defaultdict(WindowState)
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> tuple[bool, dict]:
        with self.lock:
            now = time.time()
            state = self.states[identifier]
 
            # Check if we've moved past the current window
            elapsed = now - state.curr_window_start
 
            if elapsed >= self.window_size:
                # Advance window: current becomes previous
                state.prev_count = state.curr_count
                state.curr_count = 0
                state.curr_window_start = now
                elapsed = 0
 
            # What fraction of the previous window is still in the sliding range?
            # (window_size - elapsed) / window_size
            prev_weight = (self.window_size - elapsed) / self.window_size
 
            # Weighted estimate
            estimated_count = state.prev_count * prev_weight + state.curr_count
 
            if estimated_count < self.limit:
                state.curr_count += 1
                remaining = self.limit - int(estimated_count) - 1
                return True, {
                    "limit": self.limit,
                    "remaining": max(0, remaining),
                    "estimated_count": estimated_count + 1
                }
            else:
                return False, {
                    "limit": self.limit,
                    "remaining": 0,
                    "retry_after": self.window_size - elapsed
                }

Pros and Cons

Pros	Cons
Memory efficient: O(1) per user	Slightly approximate (small error margin)
No boundary spike problem	More complex than fixed window
Good accuracy for most use cases	Not perfectly accurate
Easy to implement in Redis	Not suitable for very low limits (< 10)

Best Used For

This is the most commonly recommended algorithm for production REST APIs. It is what
Cloudflare uses internally for its rate limiting infrastructure.

4. Token Bucket

How It Works

Imagine a bucket that holds tokens. Rules:

The bucket has a maximum capacity of N tokens
Tokens are added at a constant rate R (e.g., 10 tokens per second)
Each request consumes 1 token (or more for expensive operations)
If the bucket has enough tokens, the request is allowed and tokens are consumed
If not enough tokens, the request is rejected (or waits)
If the bucket is full, new tokens are discarded (not overflow accumulated)

Capacity: 10 tokens
Refill rate: 2 tokens/second

Time 0:   [**********] (10 tokens full)
Request arrives: consume 1 token
Time 0:   [*********.]  (9 tokens)

Time 1s:  [**********] (refilled 2 tokens, but capped at 10)
10 requests arrive rapidly:
Time 1s:  [..........] (0 tokens - all 10 used, all allowed)

11th request at Time 1s: REJECTED (no tokens)

Time 2s:  [**.]       (2 tokens refilled)
2 requests arrive: both allowed

Key Properties

Burst Support: Tokens accumulate when traffic is low. A user can save up tokens
and use them in a burst. This is the defining feature of token bucket.
Smooth Refill: Tokens are added continuously, not in batches. At 2 tokens/second,
a token is available every 500ms.
Decoupled capacity and rate: capacity controls maximum burst size. rate controls
sustained throughput. You can set capacity=100, rate=10/s to allow large bursts but
sustain only 10 RPS.

Python Implementation

import time
import threading
from dataclasses import dataclass
 
 
@dataclass
class BucketState:
    tokens: float
    last_refill: float
 
 
class TokenBucket:
    """
    Token Bucket Rate Limiter.
    Supports burst traffic while enforcing average rate.
    Used by: AWS API Gateway, Stripe, Nginx (burst param).
    """
 
    def __init__(self, capacity: float, refill_rate: float):
        """
        capacity: maximum tokens in the bucket (also max burst size)
        refill_rate: tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.states: dict[str, BucketState] = {}
        self.lock = threading.Lock()
 
    def _get_state(self, identifier: str) -> BucketState:
        if identifier not in self.states:
            self.states[identifier] = BucketState(
                tokens=self.capacity,
                last_refill=time.time()
            )
        return self.states[identifier]
 
    def _refill(self, state: BucketState, now: float) -> None:
        elapsed = now - state.last_refill
        tokens_to_add = elapsed * self.refill_rate
        state.tokens = min(self.capacity, state.tokens + tokens_to_add)
        state.last_refill = now
 
    def consume(self, identifier: str, tokens: float = 1.0) -> tuple[bool, dict]:
        """
        Attempt to consume 'tokens' from the bucket.
        Returns (allowed, metadata).
        """
        with self.lock:
            now = time.time()
            state = self._get_state(identifier)
            self._refill(state, now)
 
            if state.tokens >= tokens:
                state.tokens -= tokens
                wait_time = 0
                return True, {
                    "allowed": True,
                    "tokens_remaining": state.tokens,
                    "tokens_consumed": tokens
                }
            else:
                # How long until enough tokens are available?
                tokens_needed = tokens - state.tokens
                wait_time = tokens_needed / self.refill_rate
                return False, {
                    "allowed": False,
                    "tokens_remaining": state.tokens,
                    "retry_after_seconds": wait_time
                }
 
    def try_consume_or_wait(
        self, identifier: str,
        tokens: float = 1.0,
        max_wait_seconds: float = 0
    ) -> bool:
        """
        Optionally block/wait until a token is available.
        max_wait_seconds=0 means never wait (pure non-blocking).
        """
        allowed, meta = self.consume(identifier, tokens)
        if allowed:
            return True
 
        retry_after = meta.get("retry_after_seconds", 0)
        if max_wait_seconds > 0 and retry_after <= max_wait_seconds:
            time.sleep(retry_after)
            return self.try_consume_or_wait(identifier, tokens, 0)
        return False
 
 
# Usage Example
bucket = TokenBucket(capacity=10, refill_rate=2)  # 10 burst, 2/second sustained
 
# Fast burst - first 10 allowed
for i in range(12):
    allowed, meta = bucket.consume("user_123")
    print(f"Request {i+1}: {'ALLOW' if allowed else 'DENY'} "
          f"(tokens: {meta.get('tokens_remaining', 0):.1f})")
 
# Wait for tokens to refill
time.sleep(1)
print("\nAfter 1 second (2 tokens refilled):")
for i in range(3):
    allowed, meta = bucket.consume("user_123")
    print(f"Request {i+1}: {'ALLOW' if allowed else 'DENY'} "
          f"(tokens: {meta.get('tokens_remaining', 0):.1f})")

Cost-Based Token Consumption

A powerful feature: different operations consume different numbers of tokens.

class CostBasedTokenBucket(TokenBucket):
    """
    Different operations cost different numbers of tokens.
    Example: GraphQL queries with varying complexity.
    """
    OPERATION_COSTS = {
        "simple_read":    1,
        "list_query":     3,
        "complex_search": 10,
        "export":         25,
        "bulk_delete":    50,
    }
 
    def is_operation_allowed(self, identifier: str, operation: str) -> bool:
        cost = self.OPERATION_COSTS.get(operation, 1)
        allowed, _ = self.consume(identifier, tokens=cost)
        return allowed

Pros and Cons

Pros	Cons
Allows controlled bursting	Slightly more complex than fixed window
Continuous token refill is fair	Requires float arithmetic
Natural fit for sustained rate limits	State per user needed
Supports weighted operations

Best Used For

APIs where users should be allowed to burst occasionally
Rate limiting expensive operations with different costs
SDKs and client libraries
Stripe, AWS API Gateway, and many cloud provider APIs use this

5. Leaky Bucket

How It Works

The leaky bucket analogy: water (requests) enters a bucket from the top. The bucket has
a hole at the bottom that lets water drip out at a constant rate. If the bucket overflows
(is full), additional water (requests) is discarded.

Key properties:

Requests are QUEUED in the bucket (not immediately processed)
Processing happens at a FIXED, CONSTANT rate
If the queue is full, new requests are dropped
Output traffic is always smooth and uniform

Incoming:  |||||  |||||||||||  ||  ||||||||||||||||||||
           (burst)  (burst)        (big burst, some dropped)

Processing: | | | | | | | | | | | | | | | | | | | | |
            (steady, uniform output - all requests processed at same rate)

The key difference from token bucket:

Token bucket: processes requests immediately if tokens available (bursty output OK)
Leaky bucket: always outputs at the same rate regardless of input (smooth output)

Python Implementation (Queue-Based)

import queue
import threading
import time
from typing import Callable, Any
 
 
class LeakyBucket:
    """
    Leaky Bucket Rate Limiter.
    Output rate is always constant regardless of input bursts.
    Used for: Traffic shaping, smoothing output to downstream services.
    """
 
    def __init__(self, capacity: int, leak_rate: float):
        """
        capacity: max requests the queue can hold
        leak_rate: requests processed per second (constant)
        """
        self.capacity = capacity
        self.leak_rate = leak_rate
        self.queue = queue.Queue(maxsize=capacity)
        self.running = False
        self._processor_thread = None
 
    def submit(self, request: Any) -> bool:
        """
        Submit a request to the bucket.
        Returns True if accepted (queued), False if dropped (bucket full).
        """
        try:
            self.queue.put_nowait(request)
            return True
        except queue.Full:
            return False  # Drop the request
 
    def start(self, processor: Callable[[Any], None]) -> None:
        """Start the background processor that leaks at a fixed rate."""
        self.running = True
        self._processor_thread = threading.Thread(
            target=self._leak, args=(processor,), daemon=True
        )
        self._processor_thread.start()
 
    def _leak(self, processor: Callable[[Any], None]) -> None:
        interval = 1.0 / self.leak_rate
        while self.running:
            try:
                request = self.queue.get(timeout=interval)
                processor(request)
            except queue.Empty:
                pass
            time.sleep(interval)
 
    def stop(self) -> None:
        self.running = False
 
 
# Usage Example
def process_request(req):
    print(f"Processing: {req} at {time.time():.2f}")
 
bucket = LeakyBucket(capacity=20, leak_rate=5)  # queue 20, process 5/second
bucket.start(process_request)
 
# Send a burst of 30 requests instantly
for i in range(30):
    accepted = bucket.submit(f"request_{i+1}")
    if not accepted:
        print(f"request_{i+1}: DROPPED (bucket full)")
 
time.sleep(5)  # Watch them drain at 5/second
bucket.stop()

As a Counter (Redis-Compatible)

The leaky bucket can also be implemented as a counter algorithm (without a real queue):

class LeakyBucketCounter:
    """
    Leaky bucket as a counter - useful for Redis implementation.
    Instead of a real queue, tracks the "water level" as a number.
    """
 
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # units per second
        self.levels: dict[str, dict] = {}
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> bool:
        with self.lock:
            now = time.time()
 
            if identifier not in self.levels:
                self.levels[identifier] = {
                    "level": 0.0,
                    "last_check": now
                }
 
            state = self.levels[identifier]
            elapsed = now - state["last_check"]
 
            # Leak out based on elapsed time
            leaked = elapsed * self.leak_rate
            state["level"] = max(0.0, state["level"] - leaked)
            state["last_check"] = now
 
            # Check if there's room
            if state["level"] + 1 <= self.capacity:
                state["level"] += 1
                return True
            return False

Pros and Cons

Pros	Cons
Perfectly smooth output traffic	Does NOT allow bursting
Protects downstream at a precise rate	Queue adds latency
Simple to reason about output	Requests may wait a long time in queue
Good for traffic shaping	Drop policy discards valid requests

Token Bucket vs Leaky Bucket

Scenario	Use Token Bucket	Use Leaky Bucket
API rate limiting (user-facing)	Yes - allow bursts	Usually not
Calling a payment gateway	Yes - occasional bursts OK	OK if smooth calls needed
Protecting a fragile DB	Maybe	Yes - DB gets predictable load
Traffic shaping at network level	No	Yes
Allowing short spikes	Yes	No

6. GCRA - Generic Cell Rate Algorithm

Overview

GCRA (Generic Cell Rate Algorithm), also known as the Virtual Scheduling Algorithm, originated
in ATM (Asynchronous Transfer Mode) networking. It is a highly precise, elegant algorithm
that provides the same behavior as leaky bucket but with O(1) memory per client.

GCRA is used by Stripe's rate limiter (open-sourced in their Throttled library).

Core Concept: Theoretical Arrival Time (TAT)

The key concept in GCRA is the Theoretical Arrival Time (TAT) - when the next request
should theoretically arrive if traffic is perfectly spaced.

emission_interval = 1 / rate  (time between two consecutive allowed requests)

TAT = max(now, previous_TAT) + emission_interval

If TAT <= now + burst_tolerance:
    Allow the request
    Update TAT
Else:
    Reject (too early)

Example

Rate: 10 requests/second
Emission interval: 100ms
Burst tolerance: 500ms (allows up to 5 extra requests in a burst)

Request 1 at t=0ms:
  TAT = max(0, 0) + 100 = 100ms
  100 <= 0 + 500? -> 100 <= 500 -> YES -> ALLOW, TAT = 100ms

Request 2 at t=0ms (immediately after):
  TAT = max(0, 100) + 100 = 200ms
  200 <= 0 + 500? -> 200 <= 500 -> YES -> ALLOW, TAT = 200ms

Request 6 at t=0ms (6th rapid request):
  TAT = max(0, 500) + 100 = 600ms
  600 <= 0 + 500? -> 600 <= 500? -> NO -> REJECT

Request after 600ms:
  TAT = max(600, 500) + 100 = 700ms
  700 <= 600 + 500? -> 700 <= 1100? -> YES -> ALLOW

Python Implementation

import time
from dataclasses import dataclass
import threading
 
 
class GCRARateLimiter:
    """
    GCRA (Generic Cell Rate Algorithm) Rate Limiter.
    Also known as: Virtual Scheduling Algorithm, Leaky Bucket as a Meter.
    Used by: Stripe's throttled library.
    Memory: O(1) per user (single timestamp).
    """
 
    def __init__(self, rate: float, burst: int):
        """
        rate: requests per second allowed (sustained)
        burst: maximum burst size above the sustained rate
        """
        self.emission_interval = 1.0 / rate  # seconds between requests at sustained rate
        self.burst_offset = burst * self.emission_interval  # TAT tolerance
        self.tat_store: dict[str, float] = {}  # user -> TAT
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> tuple[bool, dict]:
        with self.lock:
            now = time.time()
 
            # Get or initialize TAT
            previous_tat = self.tat_store.get(identifier, now)
 
            # New TAT = max(now, previous_TAT) + emission_interval
            new_tat = max(now, previous_tat) + self.emission_interval
 
            # Allowed if new TAT is within burst tolerance
            allowed_at = new_tat - self.burst_offset
 
            if allowed_at <= now:
                self.tat_store[identifier] = new_tat
                return True, {
                    "allowed": True,
                    "next_allowed_in": 0
                }
            else:
                # How long until the request would be allowed?
                wait_time = allowed_at - now
                return False, {
                    "allowed": False,
                    "retry_after_seconds": wait_time
                }
 
 
# Usage
gcra = GCRARateLimiter(rate=10, burst=5)  # 10 RPS sustained, 5 burst
 
# Test burst
for i in range(8):
    allowed, meta = gcra.is_allowed("user_123")
    print(f"Request {i+1}: {'ALLOW' if allowed else f'DENY (retry in {meta[\"retry_after_seconds\"]:.3f}s)'}")

GCRA vs Token Bucket Comparison

Both achieve the same effective behavior, but implement it differently:

Token Bucket:
  - Tracks: (current_tokens, last_refill_time)
  - Memory: 2 values per user
  - Conceptual: "How many tokens do I have?"

GCRA:
  - Tracks: (TAT - theoretical arrival time)
  - Memory: 1 value per user
  - Conceptual: "When should the next request theoretically arrive?"

Pros and Cons

Pros	Cons
Minimal memory: 1 timestamp per user	Less intuitive to understand
Highly precise	Harder to explain to stakeholders
Naturally allows burst	Less widely known
Used in production at Stripe

7. Algorithm Comparison Table

Property	Fixed Window	Sliding Window Log	Sliding Window Counter	Token Bucket	Leaky Bucket	GCRA
Memory per user	O(1)	O(limit)	O(1)	O(1)	O(capacity)	O(1)
Accuracy	Low (boundary bug)	Highest	High (~1% error)	High	High	High
Burst support	No	No	No	Yes	No	Yes (configurable)
Smooth output	No	No	No	No	Yes	No
Implementation complexity	Very low	Medium	Medium	Low	Medium	Medium
Redis-friendly	Yes	Yes (ZSET)	Yes	Yes (Lua)	Yes	Yes (Lua)
Real-world usage	Very common	Rare at scale	Very common	Very common	Niche	Niche (Stripe)
Time complexity	O(1)	O(log n) ZSET ops	O(1)	O(1)	O(1)	O(1)

8. Decision Framework: Choosing the Right Algorithm

Start with these questions:

Q1: Do you need to allow burst traffic?

YES -> Token Bucket or GCRA
NO -> Any algorithm works; prefer Sliding Window Counter for accuracy

Q2: How much memory can you afford per user?

Very constrained -> Fixed Window (1 int) or Token Bucket (2 floats) or GCRA (1 float)
Moderate -> Sliding Window Counter (2 ints + 1 float)
Generous -> Sliding Window Log (but only for low-limit, high-accuracy needs)

Q3: How accurate must the limit be?

Exact accuracy required (payments, security) -> Sliding Window Log with Lua
Approximate is fine (general API) -> Sliding Window Counter or Token Bucket
Rough is fine (DOS protection) -> Fixed Window

Q4: Do you need smooth output to protect a downstream service?

YES -> Leaky Bucket
NO -> Any other algorithm

Q5: How complex can your implementation be?

Simple/quick -> Fixed Window
Standard -> Token Bucket or Sliding Window Counter
Advanced -> GCRA

Recommendation by Use Case

Use Case	Recommended Algorithm	Reason
General REST API rate limiting	Sliding Window Counter	Accuracy + memory efficiency
Public API with burst allowance	Token Bucket	Burst support + simplicity
Security: login brute force	Sliding Window Log	Accuracy is critical
Payment API rate limiting	Token Bucket or GCRA	Precise control
Traffic shaping to a fragile DB	Leaky Bucket	Smooth, predictable output
Simple internal API	Fixed Window	Good enough, very simple
High-volume SDK	Token Bucket	SDK clients need burst support
GraphQL with query cost	Token Bucket (cost-based)	Variable token consumption

Real-World Choices

Company / Product	Algorithm Used
Stripe	GCRA (via Throttled library)
GitHub REST API	Fixed Window (per hour)
GitHub GraphQL API	Token Bucket (point-based)
Cloudflare	Sliding Window Counter
AWS API Gateway	Token Bucket
Nginx	Token Bucket (limit_req with burst)
Kong Gateway	Sliding Window (configurable)
Twitter/X v2 API	Fixed Window per 15-min window

Summary

Fixed Window:          Simple, low memory. Boundary bug allows 2x bursts.
Sliding Window Log:    Accurate, high memory. Best for critical endpoints.
Sliding Window Counter: Accurate enough, low memory. Best general-purpose choice.
Token Bucket:          Allows bursting. Best for user-facing APIs.
Leaky Bucket:          Smooth output. Best for protecting downstream services.
GCRA:                  Like token bucket but 1 value per user. Used by Stripe.

Next: Part 3 - Implementation Guide

See production-ready code for all these algorithms using Redis, Java/Spring Boot,
Python, Node.js, Nginx, and API Gateways.

Series: Rate Limiting Demystified

Rate Limiting Demystified - Part 2: Algorithms Deep Dive

Table of Contents

1. Fixed Window Counter

How It Works

The Boundary Problem (Critical Flaw)

Python Implementation

Redis Implementation

Pros and Cons

Best Used For

2. Sliding Window Log

How It Works

Python Implementation

Redis Implementation (Sorted Sets)

Pros and Cons

Memory Calculation

3. Sliding Window Counter (Hybrid)

How It Works

Visual Explanation

Python Implementation

Pros and Cons

Best Used For

4. Token Bucket

How It Works

Key Properties

Python Implementation

Cost-Based Token Consumption

Pros and Cons

Best Used For

5. Leaky Bucket

How It Works

Python Implementation (Queue-Based)

As a Counter (Redis-Compatible)

Pros and Cons

Token Bucket vs Leaky Bucket

6. GCRA - Generic Cell Rate Algorithm

Overview

Core Concept: Theoretical Arrival Time (TAT)

Example

Python Implementation

GCRA vs Token Bucket Comparison

Pros and Cons

7. Algorithm Comparison Table

8. Decision Framework: Choosing the Right Algorithm

Start with these questions:

Recommendation by Use Case

Real-World Choices

Summary