← Back to Articles
6/6/2026Admin Post

rate limiting part2 algorithms

Rate Limiting Demystified - Part 2: Algorithms Deep Dive

Series Navigation:
Index |
Part 1 - Fundamentals |
Part 3 - Implementation |
Part 4 - Distributed |
Part 5 - Advanced |
Part 6 - Interview Questions


Table of Contents

  1. Fixed Window Counter
  2. Sliding Window Log
  3. Sliding Window Counter (Hybrid)
  4. Token Bucket
  5. Leaky Bucket
  6. GCRA - Generic Cell Rate Algorithm
  7. Algorithm Comparison Table
  8. Decision Framework: Choosing the Right Algorithm

1. Fixed Window Counter

How It Works

Divide time into fixed, discrete windows (e.g., every 60 seconds). Count requests within
each window. If the count exceeds the limit, reject.

Timeline:
|--- Window 1 ---|--- Window 2 ---|--- Window 3 ---|
| 0s        60s  | 60s      120s  | 120s     180s  |
|  count: 100    |  count: 47     |  count: 0      |

At second 0, the window resets to 0. Every request increments the counter. At second 60,
the window resets again.

The Boundary Problem (Critical Flaw)

The fixed window has one famous flaw: a user can send 2x the limit in a short time by
exploiting window boundaries.

Time:    ... 11:59:45  11:59:55  12:00:05  12:00:15 ...
                 |          |         |          |
Window:  Window 1 (limit 100) | Window 2 (limit 100)

Requests:  100 requests at 11:59:50  (end of Window 1, all 100 allowed)
         + 100 requests at 12:00:10  (start of Window 2, all 100 allowed)

Result: 200 requests in 20 seconds!
Effective rate: 10x the intended 1 per second limit.

This is not a theoretical problem - it is exploited in practice.

Python Implementation

import time
from collections import defaultdict
import threading
 
 
class FixedWindowRateLimiter:
    """
    Fixed Window Counter Rate Limiter.
    Thread-safe implementation for single-process use.
    For distributed use, replace with Redis INCR + EXPIRE.
    """
 
    def __init__(self, limit: int, window_size_seconds: int):
        self.limit = limit
        self.window_size = window_size_seconds
        self.counters = defaultdict(int)
        self.lock = threading.Lock()
 
    def _get_window_key(self, identifier: str) -> str:
        current_window = int(time.time()) // self.window_size
        return f"{identifier}:{current_window}"
 
    def is_allowed(self, identifier: str) -> bool:
        """
        Returns True if the request is allowed, False if rate limited.
        """
        with self.lock:
            key = self._get_window_key(identifier)
            self.counters[key] += 1
            return self.counters[key] <= self.limit
 
    def get_remaining(self, identifier: str) -> int:
        with self.lock:
            key = self._get_window_key(identifier)
            used = self.counters.get(key, 0)
            return max(0, self.limit - used)
 
    def get_reset_time(self) -> int:
        """Returns Unix timestamp when the current window resets."""
        current_window = int(time.time()) // self.window_size
        return (current_window + 1) * self.window_size
 
 
# Usage
limiter = FixedWindowRateLimiter(limit=100, window_size_seconds=60)
 
for i in range(105):
    user_id = "user_123"
    if limiter.is_allowed(user_id):
        print(f"Request {i+1}: ALLOWED (remaining: {limiter.get_remaining(user_id)})")
    else:
        print(f"Request {i+1}: REJECTED (resets at: {limiter.get_reset_time()})")

Redis Implementation

import redis
import time
 
 
def is_allowed_fixed_window(
    redis_client: redis.Redis,
    identifier: str,
    limit: int,
    window_seconds: int
) -> tuple[bool, dict]:
    """
    Fixed window rate limiting using Redis INCR + EXPIRE.
    Returns (is_allowed, metadata_dict).
    """
    current_window = int(time.time()) // window_seconds
    key = f"rl:fw:{identifier}:{current_window}"
 
    # Pipeline for atomicity: INCR and EXPIRE in one round trip
    pipe = redis_client.pipeline()
    pipe.incr(key)
    pipe.ttl(key)
    results = pipe.execute()
 
    count = results[0]
    ttl = results[1]
 
    # Set expiry only on first request in this window
    if count == 1:
        redis_client.expire(key, window_seconds * 2)
        ttl = window_seconds
 
    allowed = count <= limit
    return allowed, {
        "limit": limit,
        "remaining": max(0, limit - count),
        "reset_in": ttl,
        "current_count": count
    }

Pros and Cons

ProsCons
Simple to understandBoundary problem allows 2x burst
Very low memory usage (1 counter per user)Not accurate at window edges
O(1) time complexitySudden reset can cause spikes
Easy to implement with Redis INCR

Best Used For

  • Simple internal APIs where boundary accuracy is not critical
  • Memory-constrained environments
  • Quick prototypes

2. Sliding Window Log

How It Works

Instead of counting in discrete windows, keep a log (list) of timestamps of every request.
When a new request arrives:

  1. Remove all timestamps older than now - window_size
  2. Count remaining timestamps
  3. If count < limit, allow the request and add current timestamp
  4. Otherwise, reject
Window = 60 seconds, Limit = 5

Time: 10:00:00
Log: [empty]
Request arrives -> log: [10:00:00] -> count: 1 -> ALLOW

Time: 10:00:30
Log: [10:00:00]
Request arrives -> log: [10:00:00, 10:00:30] -> count: 2 -> ALLOW

Time: 10:01:00
Log: [10:00:00, 10:00:30] -- remove 10:00:00 (it's > 60s old)
Log: [10:00:30]
Request arrives -> log: [10:00:30, 10:01:00] -> count: 2 -> ALLOW

No boundary problem. The window truly slides with every request.

Python Implementation

import time
from collections import defaultdict, deque
import threading
 
 
class SlidingWindowLog:
    """
    Sliding Window Log Rate Limiter.
    Accurate but memory-intensive: O(limit) per user.
    """
 
    def __init__(self, limit: int, window_size_seconds: float):
        self.limit = limit
        self.window_size = window_size_seconds
        self.logs = defaultdict(deque)  # user_id -> deque of timestamps
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> tuple[bool, dict]:
        with self.lock:
            now = time.time()
            window_start = now - self.window_size
            log = self.logs[identifier]
 
            # Step 1: Evict timestamps outside the window
            while log and log[0] <= window_start:
                log.popleft()
 
            # Step 2: Count current requests in window
            current_count = len(log)
 
            # Step 3: Check against limit
            if current_count < self.limit:
                log.append(now)
                return True, {
                    "limit": self.limit,
                    "remaining": self.limit - current_count - 1,
                    "current_count": current_count + 1
                }
            else:
                # Calculate when the oldest request will expire
                retry_after = log[0] + self.window_size - now if log else 0
                return False, {
                    "limit": self.limit,
                    "remaining": 0,
                    "retry_after": retry_after,
                    "current_count": current_count
                }

Redis Implementation (Sorted Sets)

import redis
import time
import uuid
 
 
def is_allowed_sliding_window_log(
    redis_client: redis.Redis,
    identifier: str,
    limit: int,
    window_seconds: int
) -> tuple[bool, dict]:
    """
    Sliding window log using Redis Sorted Sets (ZSET).
    Score = timestamp. Members = unique request IDs.
    """
    now = time.time()
    window_start = now - window_seconds
    key = f"rl:swl:{identifier}"
 
    # Use pipeline for batch execution
    pipe = redis_client.pipeline()
    # Remove timestamps outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Count requests in window
    pipe.zcard(key)
    # Add current request with timestamp as score
    pipe.zadd(key, {str(uuid.uuid4()): now})
    # Set key expiry to auto-cleanup
    pipe.expire(key, window_seconds * 2)
    results = pipe.execute()
 
    count_before = results[1]  # count BEFORE adding this request
 
    if count_before < limit:
        return True, {
            "limit": limit,
            "remaining": limit - count_before - 1
        }
    else:
        # Remove the request we just added (it was rejected)
        # In production, use a Lua script to make this atomic
        redis_client.zpopmax(key)
        return False, {
            "limit": limit,
            "remaining": 0
        }

Important: The Redis pipeline above is NOT atomic. A concurrent request can sneak in
between steps. Use a Lua script for true atomicity. See Part 3 for the Lua implementation.

Pros and Cons

ProsCons
Perfectly accurate - no boundary problemHigh memory: O(requests) per user
True sliding window behaviorMemory grows with traffic
Fair to all usersCleanup overhead

Memory Calculation

If your limit is 1,000 requests/minute and you have 100,000 users:

  • Each user: 1,000 timestamps x 8 bytes = ~8KB
  • 100,000 users: ~800MB just for rate limit logs

This is why sliding window log is rarely used at scale without the hybrid counter approach.


3. Sliding Window Counter (Hybrid)

How It Works

This is the practical solution that balances accuracy with memory efficiency. It combines
fixed windows with a weighted estimate.

Key insight: Instead of storing every timestamp, store just two counters:

  • Previous window count
  • Current window count

Use a linear interpolation to estimate how many requests occurred in the true sliding window.

Formula:

estimated_count = previous_window_count * (1 - overlap) + current_window_count

where:
  overlap = fraction of current window elapsed = elapsed_time / window_size
  (1 - overlap) = fraction of previous window still "visible"

Visual Explanation

Window size = 60s, Current time = 12:01:15

Previous window: 12:00:00 - 12:01:00 (count: 84)
Current window:  12:01:00 - 12:02:00 (count: 23, elapsed: 15s)

Overlap of previous window = (60 - 15) / 60 = 0.75 (75% of previous window is still in range)

Estimated count = 84 * 0.75 + 23 = 63 + 23 = 86

If limit is 100: 86 < 100 -> ALLOW

This is an approximation with at most 1/limit error rate. For a limit of 1000, the maximum
error is 0.1%. Cloudflare uses this algorithm.

Python Implementation

import time
from dataclasses import dataclass, field
from collections import defaultdict
import threading
 
 
@dataclass
class WindowState:
    prev_count: int = 0
    curr_count: int = 0
    curr_window_start: float = field(default_factory=time.time)
 
 
class SlidingWindowCounter:
    """
    Sliding Window Counter (Hybrid) Rate Limiter.
    Memory: O(1) per user (just two counters).
    Accuracy: Within 0.1% error for typical limits.
    Used by: Cloudflare, many production APIs.
    """
 
    def __init__(self, limit: int, window_size_seconds: float):
        self.limit = limit
        self.window_size = window_size_seconds
        self.states: dict[str, WindowState] = defaultdict(WindowState)
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> tuple[bool, dict]:
        with self.lock:
            now = time.time()
            state = self.states[identifier]
 
            # Check if we've moved past the current window
            elapsed = now - state.curr_window_start
 
            if elapsed >= self.window_size:
                # Advance window: current becomes previous
                state.prev_count = state.curr_count
                state.curr_count = 0
                state.curr_window_start = now
                elapsed = 0
 
            # What fraction of the previous window is still in the sliding range?
            # (window_size - elapsed) / window_size
            prev_weight = (self.window_size - elapsed) / self.window_size
 
            # Weighted estimate
            estimated_count = state.prev_count * prev_weight + state.curr_count
 
            if estimated_count < self.limit:
                state.curr_count += 1
                remaining = self.limit - int(estimated_count) - 1
                return True, {
                    "limit": self.limit,
                    "remaining": max(0, remaining),
                    "estimated_count": estimated_count + 1
                }
            else:
                return False, {
                    "limit": self.limit,
                    "remaining": 0,
                    "retry_after": self.window_size - elapsed
                }

Pros and Cons

ProsCons
Memory efficient: O(1) per userSlightly approximate (small error margin)
No boundary spike problemMore complex than fixed window
Good accuracy for most use casesNot perfectly accurate
Easy to implement in RedisNot suitable for very low limits (< 10)

Best Used For

This is the most commonly recommended algorithm for production REST APIs. It is what
Cloudflare uses internally for its rate limiting infrastructure.


4. Token Bucket

How It Works

Imagine a bucket that holds tokens. Rules:

  • The bucket has a maximum capacity of N tokens
  • Tokens are added at a constant rate R (e.g., 10 tokens per second)
  • Each request consumes 1 token (or more for expensive operations)
  • If the bucket has enough tokens, the request is allowed and tokens are consumed
  • If not enough tokens, the request is rejected (or waits)
  • If the bucket is full, new tokens are discarded (not overflow accumulated)
Capacity: 10 tokens
Refill rate: 2 tokens/second

Time 0:   [**********] (10 tokens full)
Request arrives: consume 1 token
Time 0:   [*********.]  (9 tokens)

Time 1s:  [**********] (refilled 2 tokens, but capped at 10)
10 requests arrive rapidly:
Time 1s:  [..........] (0 tokens - all 10 used, all allowed)

11th request at Time 1s: REJECTED (no tokens)

Time 2s:  [**.]       (2 tokens refilled)
2 requests arrive: both allowed

Key Properties

  1. Burst Support: Tokens accumulate when traffic is low. A user can save up tokens
    and use them in a burst. This is the defining feature of token bucket.

  2. Smooth Refill: Tokens are added continuously, not in batches. At 2 tokens/second,
    a token is available every 500ms.

  3. Decoupled capacity and rate: capacity controls maximum burst size. rate controls
    sustained throughput. You can set capacity=100, rate=10/s to allow large bursts but
    sustain only 10 RPS.

Python Implementation

import time
import threading
from dataclasses import dataclass
 
 
@dataclass
class BucketState:
    tokens: float
    last_refill: float
 
 
class TokenBucket:
    """
    Token Bucket Rate Limiter.
    Supports burst traffic while enforcing average rate.
    Used by: AWS API Gateway, Stripe, Nginx (burst param).
    """
 
    def __init__(self, capacity: float, refill_rate: float):
        """
        capacity: maximum tokens in the bucket (also max burst size)
        refill_rate: tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.states: dict[str, BucketState] = {}
        self.lock = threading.Lock()
 
    def _get_state(self, identifier: str) -> BucketState:
        if identifier not in self.states:
            self.states[identifier] = BucketState(
                tokens=self.capacity,
                last_refill=time.time()
            )
        return self.states[identifier]
 
    def _refill(self, state: BucketState, now: float) -> None:
        elapsed = now - state.last_refill
        tokens_to_add = elapsed * self.refill_rate
        state.tokens = min(self.capacity, state.tokens + tokens_to_add)
        state.last_refill = now
 
    def consume(self, identifier: str, tokens: float = 1.0) -> tuple[bool, dict]:
        """
        Attempt to consume 'tokens' from the bucket.
        Returns (allowed, metadata).
        """
        with self.lock:
            now = time.time()
            state = self._get_state(identifier)
            self._refill(state, now)
 
            if state.tokens >= tokens:
                state.tokens -= tokens
                wait_time = 0
                return True, {
                    "allowed": True,
                    "tokens_remaining": state.tokens,
                    "tokens_consumed": tokens
                }
            else:
                # How long until enough tokens are available?
                tokens_needed = tokens - state.tokens
                wait_time = tokens_needed / self.refill_rate
                return False, {
                    "allowed": False,
                    "tokens_remaining": state.tokens,
                    "retry_after_seconds": wait_time
                }
 
    def try_consume_or_wait(
        self, identifier: str,
        tokens: float = 1.0,
        max_wait_seconds: float = 0
    ) -> bool:
        """
        Optionally block/wait until a token is available.
        max_wait_seconds=0 means never wait (pure non-blocking).
        """
        allowed, meta = self.consume(identifier, tokens)
        if allowed:
            return True
 
        retry_after = meta.get("retry_after_seconds", 0)
        if max_wait_seconds > 0 and retry_after <= max_wait_seconds:
            time.sleep(retry_after)
            return self.try_consume_or_wait(identifier, tokens, 0)
        return False
 
 
# Usage Example
bucket = TokenBucket(capacity=10, refill_rate=2)  # 10 burst, 2/second sustained
 
# Fast burst - first 10 allowed
for i in range(12):
    allowed, meta = bucket.consume("user_123")
    print(f"Request {i+1}: {'ALLOW' if allowed else 'DENY'} "
          f"(tokens: {meta.get('tokens_remaining', 0):.1f})")
 
# Wait for tokens to refill
time.sleep(1)
print("\nAfter 1 second (2 tokens refilled):")
for i in range(3):
    allowed, meta = bucket.consume("user_123")
    print(f"Request {i+1}: {'ALLOW' if allowed else 'DENY'} "
          f"(tokens: {meta.get('tokens_remaining', 0):.1f})")

Cost-Based Token Consumption

A powerful feature: different operations consume different numbers of tokens.

class CostBasedTokenBucket(TokenBucket):
    """
    Different operations cost different numbers of tokens.
    Example: GraphQL queries with varying complexity.
    """
    OPERATION_COSTS = {
        "simple_read":    1,
        "list_query":     3,
        "complex_search": 10,
        "export":         25,
        "bulk_delete":    50,
    }
 
    def is_operation_allowed(self, identifier: str, operation: str) -> bool:
        cost = self.OPERATION_COSTS.get(operation, 1)
        allowed, _ = self.consume(identifier, tokens=cost)
        return allowed

Pros and Cons

ProsCons
Allows controlled burstingSlightly more complex than fixed window
Continuous token refill is fairRequires float arithmetic
Natural fit for sustained rate limitsState per user needed
Supports weighted operations

Best Used For

  • APIs where users should be allowed to burst occasionally
  • Rate limiting expensive operations with different costs
  • SDKs and client libraries
  • Stripe, AWS API Gateway, and many cloud provider APIs use this

5. Leaky Bucket

How It Works

The leaky bucket analogy: water (requests) enters a bucket from the top. The bucket has
a hole at the bottom that lets water drip out at a constant rate. If the bucket overflows
(is full), additional water (requests) is discarded.

Key properties:

  • Requests are QUEUED in the bucket (not immediately processed)
  • Processing happens at a FIXED, CONSTANT rate
  • If the queue is full, new requests are dropped
  • Output traffic is always smooth and uniform
Incoming:  |||||  |||||||||||  ||  ||||||||||||||||||||
           (burst)  (burst)        (big burst, some dropped)

Processing: | | | | | | | | | | | | | | | | | | | | |
            (steady, uniform output - all requests processed at same rate)

The key difference from token bucket:

  • Token bucket: processes requests immediately if tokens available (bursty output OK)
  • Leaky bucket: always outputs at the same rate regardless of input (smooth output)

Python Implementation (Queue-Based)

import queue
import threading
import time
from typing import Callable, Any
 
 
class LeakyBucket:
    """
    Leaky Bucket Rate Limiter.
    Output rate is always constant regardless of input bursts.
    Used for: Traffic shaping, smoothing output to downstream services.
    """
 
    def __init__(self, capacity: int, leak_rate: float):
        """
        capacity: max requests the queue can hold
        leak_rate: requests processed per second (constant)
        """
        self.capacity = capacity
        self.leak_rate = leak_rate
        self.queue = queue.Queue(maxsize=capacity)
        self.running = False
        self._processor_thread = None
 
    def submit(self, request: Any) -> bool:
        """
        Submit a request to the bucket.
        Returns True if accepted (queued), False if dropped (bucket full).
        """
        try:
            self.queue.put_nowait(request)
            return True
        except queue.Full:
            return False  # Drop the request
 
    def start(self, processor: Callable[[Any], None]) -> None:
        """Start the background processor that leaks at a fixed rate."""
        self.running = True
        self._processor_thread = threading.Thread(
            target=self._leak, args=(processor,), daemon=True
        )
        self._processor_thread.start()
 
    def _leak(self, processor: Callable[[Any], None]) -> None:
        interval = 1.0 / self.leak_rate
        while self.running:
            try:
                request = self.queue.get(timeout=interval)
                processor(request)
            except queue.Empty:
                pass
            time.sleep(interval)
 
    def stop(self) -> None:
        self.running = False
 
 
# Usage Example
def process_request(req):
    print(f"Processing: {req} at {time.time():.2f}")
 
bucket = LeakyBucket(capacity=20, leak_rate=5)  # queue 20, process 5/second
bucket.start(process_request)
 
# Send a burst of 30 requests instantly
for i in range(30):
    accepted = bucket.submit(f"request_{i+1}")
    if not accepted:
        print(f"request_{i+1}: DROPPED (bucket full)")
 
time.sleep(5)  # Watch them drain at 5/second
bucket.stop()

As a Counter (Redis-Compatible)

The leaky bucket can also be implemented as a counter algorithm (without a real queue):

class LeakyBucketCounter:
    """
    Leaky bucket as a counter - useful for Redis implementation.
    Instead of a real queue, tracks the "water level" as a number.
    """
 
    def __init__(self, capacity: int, leak_rate: float):
        self.capacity = capacity
        self.leak_rate = leak_rate  # units per second
        self.levels: dict[str, dict] = {}
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> bool:
        with self.lock:
            now = time.time()
 
            if identifier not in self.levels:
                self.levels[identifier] = {
                    "level": 0.0,
                    "last_check": now
                }
 
            state = self.levels[identifier]
            elapsed = now - state["last_check"]
 
            # Leak out based on elapsed time
            leaked = elapsed * self.leak_rate
            state["level"] = max(0.0, state["level"] - leaked)
            state["last_check"] = now
 
            # Check if there's room
            if state["level"] + 1 <= self.capacity:
                state["level"] += 1
                return True
            return False

Pros and Cons

ProsCons
Perfectly smooth output trafficDoes NOT allow bursting
Protects downstream at a precise rateQueue adds latency
Simple to reason about outputRequests may wait a long time in queue
Good for traffic shapingDrop policy discards valid requests

Token Bucket vs Leaky Bucket

ScenarioUse Token BucketUse Leaky Bucket
API rate limiting (user-facing)Yes - allow burstsUsually not
Calling a payment gatewayYes - occasional bursts OKOK if smooth calls needed
Protecting a fragile DBMaybeYes - DB gets predictable load
Traffic shaping at network levelNoYes
Allowing short spikesYesNo

6. GCRA - Generic Cell Rate Algorithm

Overview

GCRA (Generic Cell Rate Algorithm), also known as the Virtual Scheduling Algorithm, originated
in ATM (Asynchronous Transfer Mode) networking. It is a highly precise, elegant algorithm
that provides the same behavior as leaky bucket but with O(1) memory per client.

GCRA is used by Stripe's rate limiter (open-sourced in their Throttled library).

Core Concept: Theoretical Arrival Time (TAT)

The key concept in GCRA is the Theoretical Arrival Time (TAT) - when the next request
should theoretically arrive if traffic is perfectly spaced.

emission_interval = 1 / rate  (time between two consecutive allowed requests)

TAT = max(now, previous_TAT) + emission_interval

If TAT <= now + burst_tolerance:
    Allow the request
    Update TAT
Else:
    Reject (too early)

Example

Rate: 10 requests/second
Emission interval: 100ms
Burst tolerance: 500ms (allows up to 5 extra requests in a burst)

Request 1 at t=0ms:
  TAT = max(0, 0) + 100 = 100ms
  100 <= 0 + 500? -> 100 <= 500 -> YES -> ALLOW, TAT = 100ms

Request 2 at t=0ms (immediately after):
  TAT = max(0, 100) + 100 = 200ms
  200 <= 0 + 500? -> 200 <= 500 -> YES -> ALLOW, TAT = 200ms

Request 6 at t=0ms (6th rapid request):
  TAT = max(0, 500) + 100 = 600ms
  600 <= 0 + 500? -> 600 <= 500? -> NO -> REJECT

Request after 600ms:
  TAT = max(600, 500) + 100 = 700ms
  700 <= 600 + 500? -> 700 <= 1100? -> YES -> ALLOW

Python Implementation

import time
from dataclasses import dataclass
import threading
 
 
class GCRARateLimiter:
    """
    GCRA (Generic Cell Rate Algorithm) Rate Limiter.
    Also known as: Virtual Scheduling Algorithm, Leaky Bucket as a Meter.
    Used by: Stripe's throttled library.
    Memory: O(1) per user (single timestamp).
    """
 
    def __init__(self, rate: float, burst: int):
        """
        rate: requests per second allowed (sustained)
        burst: maximum burst size above the sustained rate
        """
        self.emission_interval = 1.0 / rate  # seconds between requests at sustained rate
        self.burst_offset = burst * self.emission_interval  # TAT tolerance
        self.tat_store: dict[str, float] = {}  # user -> TAT
        self.lock = threading.Lock()
 
    def is_allowed(self, identifier: str) -> tuple[bool, dict]:
        with self.lock:
            now = time.time()
 
            # Get or initialize TAT
            previous_tat = self.tat_store.get(identifier, now)
 
            # New TAT = max(now, previous_TAT) + emission_interval
            new_tat = max(now, previous_tat) + self.emission_interval
 
            # Allowed if new TAT is within burst tolerance
            allowed_at = new_tat - self.burst_offset
 
            if allowed_at <= now:
                self.tat_store[identifier] = new_tat
                return True, {
                    "allowed": True,
                    "next_allowed_in": 0
                }
            else:
                # How long until the request would be allowed?
                wait_time = allowed_at - now
                return False, {
                    "allowed": False,
                    "retry_after_seconds": wait_time
                }
 
 
# Usage
gcra = GCRARateLimiter(rate=10, burst=5)  # 10 RPS sustained, 5 burst
 
# Test burst
for i in range(8):
    allowed, meta = gcra.is_allowed("user_123")
    print(f"Request {i+1}: {'ALLOW' if allowed else f'DENY (retry in {meta[\"retry_after_seconds\"]:.3f}s)'}")

GCRA vs Token Bucket Comparison

Both achieve the same effective behavior, but implement it differently:

Token Bucket:
  - Tracks: (current_tokens, last_refill_time)
  - Memory: 2 values per user
  - Conceptual: "How many tokens do I have?"

GCRA:
  - Tracks: (TAT - theoretical arrival time)
  - Memory: 1 value per user
  - Conceptual: "When should the next request theoretically arrive?"

Pros and Cons

ProsCons
Minimal memory: 1 timestamp per userLess intuitive to understand
Highly preciseHarder to explain to stakeholders
Naturally allows burstLess widely known
Used in production at Stripe

7. Algorithm Comparison Table

PropertyFixed WindowSliding Window LogSliding Window CounterToken BucketLeaky BucketGCRA
Memory per userO(1)O(limit)O(1)O(1)O(capacity)O(1)
AccuracyLow (boundary bug)HighestHigh (~1% error)HighHighHigh
Burst supportNoNoNoYesNoYes (configurable)
Smooth outputNoNoNoNoYesNo
Implementation complexityVery lowMediumMediumLowMediumMedium
Redis-friendlyYesYes (ZSET)YesYes (Lua)YesYes (Lua)
Real-world usageVery commonRare at scaleVery commonVery commonNicheNiche (Stripe)
Time complexityO(1)O(log n) ZSET opsO(1)O(1)O(1)O(1)

8. Decision Framework: Choosing the Right Algorithm

Start with these questions:

Q1: Do you need to allow burst traffic?

  • YES -> Token Bucket or GCRA
  • NO -> Any algorithm works; prefer Sliding Window Counter for accuracy

Q2: How much memory can you afford per user?

  • Very constrained -> Fixed Window (1 int) or Token Bucket (2 floats) or GCRA (1 float)
  • Moderate -> Sliding Window Counter (2 ints + 1 float)
  • Generous -> Sliding Window Log (but only for low-limit, high-accuracy needs)

Q3: How accurate must the limit be?

  • Exact accuracy required (payments, security) -> Sliding Window Log with Lua
  • Approximate is fine (general API) -> Sliding Window Counter or Token Bucket
  • Rough is fine (DOS protection) -> Fixed Window

Q4: Do you need smooth output to protect a downstream service?

  • YES -> Leaky Bucket
  • NO -> Any other algorithm

Q5: How complex can your implementation be?

  • Simple/quick -> Fixed Window
  • Standard -> Token Bucket or Sliding Window Counter
  • Advanced -> GCRA

Recommendation by Use Case

Use CaseRecommended AlgorithmReason
General REST API rate limitingSliding Window CounterAccuracy + memory efficiency
Public API with burst allowanceToken BucketBurst support + simplicity
Security: login brute forceSliding Window LogAccuracy is critical
Payment API rate limitingToken Bucket or GCRAPrecise control
Traffic shaping to a fragile DBLeaky BucketSmooth, predictable output
Simple internal APIFixed WindowGood enough, very simple
High-volume SDKToken BucketSDK clients need burst support
GraphQL with query costToken Bucket (cost-based)Variable token consumption

Real-World Choices

Company / ProductAlgorithm Used
StripeGCRA (via Throttled library)
GitHub REST APIFixed Window (per hour)
GitHub GraphQL APIToken Bucket (point-based)
CloudflareSliding Window Counter
AWS API GatewayToken Bucket
NginxToken Bucket (limit_req with burst)
Kong GatewaySliding Window (configurable)
Twitter/X v2 APIFixed Window per 15-min window

Summary

Fixed Window:          Simple, low memory. Boundary bug allows 2x bursts.
Sliding Window Log:    Accurate, high memory. Best for critical endpoints.
Sliding Window Counter: Accurate enough, low memory. Best general-purpose choice.
Token Bucket:          Allows bursting. Best for user-facing APIs.
Leaky Bucket:          Smooth output. Best for protecting downstream services.
GCRA:                  Like token bucket but 1 value per user. Used by Stripe.

Next: Part 3 - Implementation Guide

See production-ready code for all these algorithms using Redis, Java/Spring Boot,
Python, Node.js, Nginx, and API Gateways.