← Back to Articles
6/6/2026Admin Post

rate limiting supplement4 architecture patterns

Rate Limiting - Supplement 4: Architecture Patterns and Decision-Making

Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 3 - Trade-Offs Decision Guide

Ten named architecture patterns used by real production systems.
Each pattern includes: Problem → Solution → Architecture Diagram → Code →
ADR (Architecture Decision Record) → Trade-offs → When/When-Not to Use.


Table of Contents

  1. Pattern 1: Gateway Sentinel
  2. Pattern 2: Layered Defense
  3. Pattern 3: Quota Cascade
  4. Pattern 4: Shadow Enforcement
  5. Pattern 5: Adaptive Throttle
  6. Pattern 6: Cost-Weighted Bucket
  7. Pattern 7: Tenant-Isolated Pool
  8. Pattern 8: Idempotency Shield
  9. Pattern 9: Sidecar Enforcer
  10. Pattern 10: Hybrid Approximate
  11. ADR Template Reference
  12. Pattern Comparison Matrix

Pattern 1: Gateway Sentinel

Problem Context

Your organization has dozens of microservices. Each team is reinventing rate limiting in their service using different libraries, different Redis key formats, and inconsistent policies. Some services have rate limiting, some don't. There is no unified enforcement or visibility.

Solution

Centralize ALL rate limiting logic at the API Gateway layer. Every request must pass through the gateway before reaching any service. Rate limit decisions are made once, consistently, with full observability in one place.

Architecture Diagram

                    ┌──────────────────────────────────────────┐
                    │             INTERNET / CLIENTS            │
                    └───────────────────┬──────────────────────┘
                                        │
                    ┌───────────────────▼──────────────────────┐
                    │               API GATEWAY                  │
                    │                                            │
                    │   ┌────────────┐    ┌─────────────────┐  │
                    │   │ Rate Limit │───▶│  Redis Cluster  │  │
                    │   │  Engine    │    │  (shared state) │  │
                    │   └─────┬──────┘    └─────────────────┘  │
                    │         │                                  │
                    │   ┌─────▼──────┐                         │
                    │   │ 429 or     │                         │
                    │   │ Route to   │                         │
                    │   │ Service    │                         │
                    └───┴─────┬──────┴─────────────────────────┘
                               │
              ┌────────────────┼──────────────────┐
              │                │                  │
    ┌─────────▼──────┐ ┌───────▼─────┐ ┌─────────▼──────┐
    │   Service A     │ │  Service B  │ │   Service C     │
    │  (no rate limit │ │ (no rate    │ │  (no rate       │
    │   needed here)  │ │  limit)     │ │   limit)        │
    └─────────────────┘ └─────────────┘ └─────────────────┘

Implementation (Kong Declarative)

# kong.yml - All rate limiting defined centrally
_format_version: "3.0"
 
consumers:
  - username: free-tier
    custom_id: tier_free
  - username: pro-tier
    custom_id: tier_pro
  - username: enterprise-tier
    custom_id: tier_enterprise
 
plugins:
  # Global default - applies to ALL routes unless overridden
  - name: rate-limiting-advanced
    config:
      limit: [60]
      window_size: [60]
      sync_rate: 5 # sync with Redis every 5 seconds
      strategy: redis
      redis:
        host: redis-master
        port: 6379
        database: 0
      hide_client_headers: false
      error_message: "Rate limit exceeded. See https://docs.example.com/rate-limits"
 
  # Service-specific override: authentication (stricter)
  - name: rate-limiting-advanced
    service: auth-service
    config:
      limit: [5, 20] # 5/min AND 20/hour
      window_size: [60, 3600]
      strategy: redis
 
  # Consumer-specific: Pro tier gets 10x
  - name: rate-limiting-advanced
    consumer: pro-tier
    config:
      limit: [600]
      window_size: [60]
      strategy: redis
 
  # Consumer-specific: Enterprise tier (very high)
  - name: rate-limiting-advanced
    consumer: enterprise-tier
    config:
      limit: [10000]
      window_size: [60]
      strategy: redis

ADR-001: Gateway Sentinel

Date: 2024-01-15
Status: Accepted

Context:

  • 12 microservices, 8 teams, zero consistent rate limiting
  • Security audit revealed 3 services with no limits on sensitive endpoints
  • Engineers spending time reimplementing the same patterns

Decision:
Consolidate all rate limiting at the API Gateway (Kong) backed by shared Redis.
Individual services MUST NOT implement their own rate limiting (lint rule enforced).

Consequences — Positive:

  • Single place to audit and change rate limit policies
  • Consistent response format (HTTP 429 + Retry-After) across all services
  • Ops team can change limits without service deployments
  • End-to-end visibility in one dashboard

Consequences — Negative:

  • Gateway becomes a critical single point of failure for rate limiting
  • Cannot implement business-logic-aware limits (tier from DB)
  • Teams lose autonomy to tune their own endpoints

Mitigation:

  • Gateway in active-active HA configuration
  • Services still validate authentication (gateway only does auth extraction)
  • Business-logic limits added as a second layer in 2 critical services

Trade-Offs

ProCon
Zero rate-limit code in servicesNo access to service business context
Centralized policy managementGateway is a SPOF for RL logic
Consistent UX across all APIsMust redeploy/reconfigure gateway for limit changes
One place to monitor and alertHard to test rate limits in local dev

When to Use

  • Multi-service organization with inconsistent RL implementations
  • Developer-facing API product (developer portal + usage dashboards)
  • Teams lack expertise to implement correct distributed rate limiting

When NOT to Use

  • Your service has complex business logic requirements for limits
  • You want zero latency overhead (gateway adds ~2ms)
  • Internal-only microservices where enforcement is overkill

Pattern 2: Layered Defense

Problem Context

A single rate limiting layer can be bypassed or overwhelmed. An IP rotation attack defeats user-level limits. A DDoS at Layer 3 overwhelms application-level checks. You need defense in depth.

Solution

Implement independent rate limiting at multiple layers. Each layer catches a different class of abuse:

  • CDN/Edge: Volumetric DDoS, geographic abuse, known bad IPs
  • Load Balancer: IP-level flood protection, basic bot mitigation
  • API Gateway: API key or consumer-level limits
  • Application: User-tier-aware, endpoint-specific, business logic

If an attacker bypasses one layer, subsequent layers still protect the system.

Architecture Diagram

INTERNET
   │
   ▼
┌──────────────────────────────────────────┐
│  LAYER 1: CDN (Cloudflare)               │
│  - DDoS mitigation                       │ ← Blocks: Volumetric floods,
│  - IP reputation blocking                │           known bad IPs,
│  - Geographic restrictions               │           geographic violations
│  - Rate: 1000 req/5min per IP            │
└──────────────────┬───────────────────────┘
                   │ (attacks blocked above don't reach here)
                   ▼
┌──────────────────────────────────────────┐
│  LAYER 2: Load Balancer (Nginx)          │
│  - Per-IP: 200 req/min                   │ ← Blocks: IP floods that passed CDN,
│  - Per-IP burst: 50                      │           layer 7 DDoS attempts
│  - Connection limit: 100 concurrent/IP   │
└──────────────────┬───────────────────────┘
                   │ (only reasonable-rate IPs reach here)
                   ▼
┌──────────────────────────────────────────┐
│  LAYER 3: Application Middleware         │
│  - Per-user: tier-based limit            │ ← Blocks: Authenticated abuse,
│  - Per-endpoint: cost-weighted           │           quota exhaustion,
│  - Per-feature-flag: new features        │           insider threats
│  - Business logic (user tier)            │
└──────────────────┬───────────────────────┘
                   │ (only valid, non-rate-limited requests reach here)
                   ▼
            ┌──────────────┐
            │  SERVICE     │
            │  (business   │
            │   logic)     │
            └──────────────┘

Implementation

# Application layer (after CDN and Nginx have already filtered)
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class RateLimitConfig:
    identifier: str
    limit: int
    window_seconds: int
    layer: str
 
class LayeredDefenseMiddleware:
    """
    Application-layer component of a three-layer defense.
    CDN and Nginx provide layers 1 and 2.
    This provides the business-context-aware layer 3.
    """
 
    def __init__(self, redis_client, user_service):
        self.redis = redis_client
        self.user_service = user_service
 
    def get_applicable_limits(self, request) -> list[RateLimitConfig]:
        limits = []
        user_id = getattr(request.state, "user_id", None)
        endpoint = request.url.path
        method = request.method
 
        if user_id:
            user = self.user_service.get_cached(user_id)
            tier_limit = {
                "free": 60,
                "starter": 300,
                "pro": 1000,
                "enterprise": 10000
            }.get(user.tier, 60)
 
            limits.append(RateLimitConfig(
                identifier=f"user:{user_id}",
                limit=tier_limit,
                window_seconds=60,
                layer="user_tier"
            ))
 
            # Endpoint-specific limit (e.g., expensive ML endpoint)
            if endpoint.startswith("/api/v1/analyze"):
                limits.append(RateLimitConfig(
                    identifier=f"user:{user_id}:analyze",
                    limit=max(1, tier_limit // 10),  # 10% of normal limit
                    window_seconds=60,
                    layer="endpoint_specific"
                ))
 
        # Always check global system limit (backstop)
        limits.append(RateLimitConfig(
            identifier="global:system",
            limit=500_000,
            window_seconds=60,
            layer="global_backstop"
        ))
 
        return limits
 
    async def __call__(self, request, call_next):
        if not should_rate_limit(request):
            return await call_next(request)
 
        configs = self.get_applicable_limits(request)
 
        for config in configs:
            result = self._check_redis(config)
            if not result["allowed"]:
                return build_429_response(result, config.layer)
 
        return await call_next(request)

ADR-002: Layered Defense

Date: 2024-03-10
Status: Accepted

Context:

  • Single application-layer rate limiter was being overwhelmed by botnets
  • IP rotation attacks with thousands of IPs bypassed per-IP limits
  • Infrastructure cost spiked during attacks even though business logic blocked abusive users

Decision:
Add CDN-level and Nginx-level rate limiting in front of the application.
Each layer operates independently with its own Redis or local state.

Consequences — Positive:

  • 95% of attack traffic blocked before hitting application servers
  • CDN/Nginx layers add 0ms application latency overhead
  • Reduces application compute cost during attacks
  • Each layer can be tuned independently

Consequences — Negative:

  • Three different configuration locations to manage
  • Risk of inconsistency between layers
  • Debugging requires checking all three layers

When to Use

  • Any public-facing system that has experienced or anticipates DDoS/abuse
  • APIs with expensive compute where even processing rate-limited requests costs money
  • When application-layer rate limiting alone isn't fast enough to stop floods

When NOT to Use

  • Purely internal services with trusted callers
  • Simple intranet tools
  • When a CDN is cost-prohibitive

Pattern 3: Quota Cascade

Problem Context

You have a multi-tenant SaaS with organizations containing teams containing users. An organization purchases 100,000 API calls/month. The admin needs to allocate portions to teams, and teams allocate to users. When a quota is exceeded at any level, requests are rejected.

Solution

Implement hierarchical quota management where limits cascade from organization → team → user. Each level has its own quota counter. A request must pass ALL levels. When a quota is refilled at the top, it flows down (but teams/users retain their sub-allocations).

Architecture Diagram

Organization: ACME Corp
├── Monthly Quota: 100,000 calls
│   │
│   ├── Team: Engineering         [Allocated: 60,000]
│   │   ├── User: alice@acme.com  [Allocated: 20,000]
│   │   ├── User: bob@acme.com    [Allocated: 20,000]
│   │   └── User: carol@acme.com  [Allocated: 20,000]
│   │
│   └── Team: Marketing           [Allocated: 40,000]
│       ├── User: dave@acme.com   [Allocated: 15,000]
│       └── User: eve@acme.com    [Allocated: 25,000]
│
│   Request from alice:
│   CHECK 1: alice used 19,999 of 20,000 -> PASS (1 remaining)
│   CHECK 2: Engineering used 59,999 of 60,000 -> PASS (1 remaining)
│   CHECK 3: ACME used 99,999 of 100,000 -> PASS (1 remaining)
│   REQUEST ALLOWED. All 3 counters incremented.

Implementation

from typing import Optional
import redis
import time
 
class QuotaCascade:
    """
    Multi-level quota enforcement: User -> Team -> Organization
    Uses Redis pipelines for efficient multi-level checking.
    """
 
    def __init__(self, redis_client: redis.Redis, db):
        self.redis = redis_client
        self.db = db  # database for allocation lookup
 
    def get_quota_hierarchy(self, user_id: str, org_id: str, team_id: str) -> list[dict]:
        """Build the quota check hierarchy for this request."""
        # Cache quota allocations to avoid DB hit per-request
        cache_key = f"quota_alloc:{org_id}:{team_id}:{user_id}"
        cached = self.redis.get(cache_key)
 
        if not cached:
            alloc = self.db.get_quota_allocations(org_id, team_id, user_id)
            self.redis.setex(cache_key, 300, str(alloc))  # cache 5 minutes
        else:
            alloc = eval(cached)  # in production: use JSON or msgpack
 
        period = self.current_billing_period()
        return [
            {
                "key": f"quota:{period}:org:{org_id}",
                "limit": alloc["org_monthly"],
                "level": "organization",
                "entity": org_id,
            },
            {
                "key": f"quota:{period}:team:{org_id}:{team_id}",
                "limit": alloc["team_monthly"],
                "level": "team",
                "entity": team_id,
            },
            {
                "key": f"quota:{period}:user:{user_id}",
                "limit": alloc["user_monthly"],
                "level": "user",
                "entity": user_id,
            },
        ]
 
    # Lua script: check and increment all quota levels atomically
    LUA_CASCADE_CHECK = """
    local results = {}
    for i = 1, #KEYS do
        local current = tonumber(redis.call('GET', KEYS[i]) or 0)
        local limit = tonumber(ARGV[i])
        if current >= limit then
            results[i] = 0  -- exceeded at this level
        else
            results[i] = 1  -- allowed at this level
        end
    end
 
    -- Only increment if ALL levels pass
    local all_pass = true
    for i = 1, #results do
        if results[i] == 0 then
            all_pass = false
            break
        end
    end
 
    if all_pass then
        for i = 1, #KEYS do
            redis.call('INCR', KEYS[i])
            -- Set TTL to end of billing period if key is new
            if redis.call('TTL', KEYS[i]) == -1 then
                redis.call('EXPIREAT', KEYS[i], tonumber(ARGV[#ARGV]))
            end
        end
    end
 
    return {all_pass and 1 or 0, results}
    """
 
    def __init_lua(self):
        self._cascade_script = self.redis.register_script(self.LUA_CASCADE_CHECK)
 
    def check_and_consume(self, user_id: str, org_id: str, team_id: str) -> dict:
        hierarchy = self.get_quota_hierarchy(user_id, org_id, team_id)
        keys = [h["key"] for h in hierarchy]
        limits = [h["limit"] for h in hierarchy]
        billing_period_end = self.billing_period_end_timestamp()
 
        result = self._cascade_script(
            keys=keys,
            args=limits + [billing_period_end]
        )
 
        allowed = bool(result[0])
        level_results = result[1]
 
        if not allowed:
            # Find which level blocked
            for i, level_result in enumerate(level_results):
                if level_result == 0:
                    h = hierarchy[i]
                    return {
                        "allowed": False,
                        "blocked_at": h["level"],
                        "entity": h["entity"],
                        "message": f"{h['level'].capitalize()} quota exceeded"
                    }
 
        return {"allowed": True, "remaining": self._get_minimums(keys, limits)}
 
    def current_billing_period(self) -> str:
        now = time.gmtime()
        return f"{now.tm_year}-{now.tm_mon:02d}"
 
    def billing_period_end_timestamp(self) -> int:
        """Unix timestamp for the last second of the current billing month."""
        import calendar
        now = time.gmtime()
        last_day = calendar.monthrange(now.tm_year, now.tm_mon)[1]
        return int(time.mktime((now.tm_year, now.tm_mon, last_day, 23, 59, 59, 0, 0, 0)))

ADR-003: Quota Cascade

Date: 2024-05-20
Status: Accepted

Context:

  • Enterprise customers buying annual API quotas for their entire organization
  • No mechanism to prevent one power user from consuming the org's entire quota
  • Customer success team receiving complaints about "quota not shared fairly"

Decision:
Implement three-level quota hierarchy: Organization → Team → User.
All three levels are checked atomically in Redis using a Lua script.
Quotas are monthly-resetting, allocated by org admins in a self-service portal.

Consequences — Positive:

  • Org admins have full control over allocation to their teams
  • One user cannot exhaust the organization quota
  • Transparent: API response includes which level was exhausted

Consequences — Negative:

  • Allocation misconfiguration causes confusion (admin must keep levels in sync)
  • Complex Lua script for atomic multi-level check
  • Billing period reset logic adds complexity

When to Use

  • Multi-tenant B2B SaaS with enterprise customers
  • Organizations that need to allocate quotas across teams/departments
  • Situations where per-user limits alone are insufficient

When NOT to Use

  • B2C consumer apps (users don't form organizational hierarchies)
  • Simple per-user API products
  • When billing logic is external (e.g., external quota management service)

Pattern 4: Shadow Enforcement

Problem Context

Your service has no rate limiting and you need to add it without causing incidents. If you set the wrong limit and enforce immediately, you'll break real users on day one. You need a way to validate your limit values against real production traffic before enforcing.

Solution

Implement rate limiting in three phases:

  1. Shadow mode: Count requests, record violations, NO blocking. Monitor who would have been blocked.
  2. Warn mode: Send 200 with X-RateLimit-Warning: You would have been rate limited header. Still no blocking.
  3. Enforce mode: Full enforcement with 429 responses.

Move from shadow → warn → enforce over 2-4 weeks. Only move forward when violation rates are acceptable.

Architecture Diagram

Request comes in
      │
      ▼
  ┌──────────────────────┐
  │  Rate Limit Check    │
  │  (always runs)       │
  └──────────┬───────────┘
             │
             ▼
  ┌──────────────────────┐
  │ Would this be        │
  │ rate limited?        │
  └──────────┬───────────┘
             │
      ┌──────┴──────┐
      │ YES         │ NO
      ▼             ▼
  ┌────────┐   ┌──────────┐
  │ Check  │   │ Allow    │
  │ Mode   │   │ request  │
  └───┬────┘   └──────────┘
      │
 ┌────┴──────┬──────────────┐
 │ SHADOW    │ WARN         │ ENFORCE
 │           │              │
 │ Allow +   │ Allow +      │ Reject
 │ Log only  │ Warn header  │ HTTP 429
 └───────────┴──────────────┘

Implementation

from enum import Enum
import logging
 
class EnforcementMode(Enum):
    SHADOW = "shadow"   # Count but never block
    WARN = "warn"       # Allow but add warning header
    ENFORCE = "enforce" # Full enforcement
 
class ShadowEnforcement:
    """
    Progressive rate limit rollout.
    Mode is controlled by a feature flag (Redis or config service).
    """
 
    def __init__(self, limiter, mode_store, metrics):
        self.limiter = limiter
        self.mode_store = mode_store  # Redis-backed feature flag store
        self.metrics = metrics
        self.logger = logging.getLogger(__name__)
 
    def get_mode(self, endpoint: str) -> EnforcementMode:
        """Mode can be set globally or per-endpoint."""
        mode_str = self.mode_store.get(f"rl_mode:{endpoint}") \
                or self.mode_store.get("rl_mode:global") \
                or "shadow"
        return EnforcementMode(mode_str)
 
    async def __call__(self, request, call_next):
        if not should_rate_limit(request):
            return await call_next(request)
 
        endpoint = request.url.path
        user_id = getattr(request.state, "user_id", "anonymous")
        mode = self.get_mode(endpoint)
 
        # Always compute the rate limit result
        result = self.limiter.check(user_id, endpoint)
        would_be_blocked = not result["allowed"]
 
        # Track shadow metrics regardless of mode
        if would_be_blocked:
            self.metrics.increment(
                "rate_limit.would_block",
                tags={
                    "endpoint": endpoint,
                    "mode": mode.value,
                    "user_tier": getattr(request.state, "tier", "unknown")
                }
            )
            self.logger.info(
                "rate_limit_shadow",
                extra={
                    "user_id": user_id,
                    "endpoint": endpoint,
                    "mode": mode.value,
                    "limit": result["limit"],
                    "current": result["current"]
                }
            )
 
        # Behavior depends on mode
        if mode <mark class="obsidian-highlight"> EnforcementMode.ENFORCE and would_be_blocked:
            # Actually consume the token and return 429
            self.limiter.consume(user_id, endpoint)
            return build_429_response(result)
 
        elif mode </mark> EnforcementMode.WARN and would_be_blocked:
            # Allow the request but warn
            response = await call_next(request)
            response.headers["X-RateLimit-Warning"] = (
                f"You are exceeding your rate limit. "
                f"Enforcement begins {self.get_enforce_date()}. "
                f"Limit: {result['limit']}/min, Current: {result['current']}/min"
            )
            return response
 
        # SHADOW mode or within limits: allow normally
        return await call_next(request)

Ops runbook for progressive rollout:

# Step 1: Deploy with shadow mode (default)
redis-cli SET rl_mode:global shadow
 
# Step 2: After 1 week, review shadow metrics
redis-cli GET rl_mode:global
# Review Datadog dashboard: "Rate Limit Shadow Violations by User"
# If violation rate > 5% of users: adjust limits up
# If violation rate < 0.1%: limits may be too high (consider lowering)
 
# Step 3: Move to warn mode (users get header, not blocked)
redis-cli SET rl_mode:global warn
 
# Step 4: After 1 more week and comms to impacted users, enforce
redis-cli SET rl_mode:global enforce
 
# Per-endpoint override if one endpoint needs different timeline:
redis-cli SET rl_mode:/api/v1/search enforce  # enforce search earlier
redis-cli SET rl_mode:/api/v1/reports warn    # keep reports in warn

When to Use

  • Adding rate limiting to an existing system with real users for the first time
  • Changing rate limit values significantly (e.g., halving free tier limits)
  • Launching a new endpoint where the right limit is unknown
  • Before big limit changes that could trigger SLA violations for enterprise customers

When NOT to Use

  • New greenfield system (enforce from day one)
  • Security-critical endpoints (login, payment - enforce immediately)
  • When you are under active attack (shadow mode helps attackers)

Pattern 5: Adaptive Throttle

Problem Context

Static rate limits don't account for system health. When your database is at 95% CPU, even requests within rate limits can cause cascading failures. You need rate limits that automatically tighten when the system is stressed.

Solution

Implement dynamic rate limits that respond to real-time system health signals. When health degrades, limits decrease automatically. When health recovers, limits restore. This prevents overload without requiring manual intervention.

Architecture Diagram

Health Signal Sources:
  ┌───────────────┐  ┌──────────────┐  ┌───────────────────┐
  │ CPU / Memory  │  │ Error Rate   │  │  DB Connection    │
  │ Metrics       │  │ (p99 latency)│  │  Pool Utilization │
  └───────┬───────┘  └──────┬───────┘  └─────────┬─────────┘
          │                 │                     │
          └─────────────────┼─────────────────────┘
                            │
                   ┌────────▼────────┐
                   │  Health Score   │
                   │  Calculator     │
                   │  (0.0 - 1.0)    │
                   └────────┬────────┘
                            │
                   ┌────────▼────────┐
                   │  Limit Adjuster │
                   │                 │
                   │  base_limit x   │
                   │  health_factor  │
                   └────────┬────────┘
                            │
                   ┌────────▼────────┐
                   │   Rate Limiter  │
                   │  (dynamic limit)│
                   └─────────────────┘

Implementation

import time
import math
from dataclasses import dataclass
from typing import Protocol
 
@dataclass
class SystemHealth:
    cpu_percent: float        # 0-100
    error_rate: float         # 0.0-1.0 (fraction of requests erroring)
    p99_latency_ms: float     # milliseconds
    db_pool_pct: float        # 0-100 (percent of pool in use)
    memory_percent: float     # 0-100
 
 
class HealthCollector(Protocol):
    def collect(self) -> SystemHealth: ...
 
 
class AdaptiveThrottle:
    """
    Adjusts effective rate limits based on real-time system health.
 
    Health Score (0.0 = catastrophic, 1.0 = perfect):
      0.9 - 1.0: Full limit (normal operations)
      0.7 - 0.9: 75% of limit (minor degradation)
      0.5 - 0.7: 50% of limit (moderate stress)
      0.3 - 0.5: 25% of limit (high stress)
      0.0 - 0.3: 10% of limit (near failure - protect the system)
    """
 
    # Tunable thresholds
    THRESHOLDS = {
        "cpu": {"warn": 70, "critical": 85},
        "error_rate": {"warn": 0.01, "critical": 0.05},
        "p99_latency_ms": {"warn": 500, "critical": 2000},
        "db_pool_pct": {"warn": 70, "critical": 90},
    }
 
    def __init__(self, base_limiter, health_collector: HealthCollector, redis_client):
        self.base_limiter = base_limiter
        self.health_collector = health_collector
        self.redis = redis_client
        self._health_cache_ttl = 5  # recalculate health every 5 seconds
        self._last_health_time = 0
        self._cached_factor = 1.0
 
    def compute_health_factor(self, health: SystemHealth) -> float:
        """
        Returns a multiplier 0.1 - 1.0 for the base rate limit.
        Components are independent - worst signal wins.
        """
        factors = []
 
        # CPU factor
        cpu = health.cpu_percent
        if cpu < self.THRESHOLDS["cpu"]["warn"]:
            factors.append(1.0)
        elif cpu < self.THRESHOLDS["cpu"]["critical"]:
            # Linear interpolation: 70%->1.0, 85%->0.5
            factors.append(1.0 - 0.5 * (cpu - 70) / 15)
        else:
            # 85%+ CPU: heavily throttle
            factors.append(max(0.1, 0.5 - 0.4 * (cpu - 85) / 15))
 
        # Error rate factor
        err = health.error_rate
        if err < self.THRESHOLDS["error_rate"]["warn"]:
            factors.append(1.0)
        elif err < self.THRESHOLDS["error_rate"]["critical"]:
            factors.append(0.75)
        else:
            factors.append(0.25)
 
        # P99 latency factor
        lat = health.p99_latency_ms
        if lat < self.THRESHOLDS["p99_latency_ms"]["warn"]:
            factors.append(1.0)
        elif lat < self.THRESHOLDS["p99_latency_ms"]["critical"]:
            factors.append(0.6)
        else:
            factors.append(0.2)
 
        # DB pool factor
        pool = health.db_pool_pct
        if pool < self.THRESHOLDS["db_pool_pct"]["warn"]:
            factors.append(1.0)
        elif pool < self.THRESHOLDS["db_pool_pct"]["critical"]:
            factors.append(0.5)
        else:
            factors.append(0.1)
 
        return min(factors)  # Worst signal determines throttle level
 
    def get_effective_limit(self, base_limit: int) -> int:
        now = time.time()
        if now - self._last_health_time > self._health_cache_ttl:
            health = self.health_collector.collect()
            self._cached_factor = self.compute_health_factor(health)
            self._last_health_time = now
 
            # Publish health factor for dashboards
            self.redis.set("system:health_factor", str(self._cached_factor), ex=30)
 
        effective = max(1, int(base_limit * self._cached_factor))
        return effective
 
    def is_allowed(self, identifier: str, base_limit: int) -> dict:
        effective_limit = self.get_effective_limit(base_limit)
        result = self.base_limiter.is_allowed(identifier, limit=effective_limit)
        result["base_limit"] = base_limit
        result["effective_limit"] = effective_limit
        result["health_factor"] = self._cached_factor
        return result

When to Use

  • High-traffic services where cascading failure is a real risk
  • Services with unpredictable traffic spikes that can overload downstream dependencies
  • When you want automatic protection without manual intervention during incidents

When NOT to Use

  • Low-traffic internal services where overload is unlikely
  • Services with SLA guarantees that prohibit degradation below a certain request rate
  • When health signals are unreliable or expensive to collect

Pattern 6: Cost-Weighted Bucket

Problem Context

A GraphQL endpoint where query { user { id } } costs 1ms and query { allUsers { posts { comments { likes } } } } costs 5,000ms. Treating both as equivalent "1 request" allows a single client to consume 5,000x more resources while counting only 1 against their rate limit.

Solution

Assign a "cost" to each operation before execution. Deduct the cost from the token bucket instead of a flat count of 1. The bucket size represents compute units (e.g., 1,000 units/minute), not request count. Simple queries cost 1-5 units; complex queries cost 50-500 units.

Implementation

from graphql import parse, build_ast_schema
from typing import Any
 
class GraphQLCostAnalyzer:
    """
    Computes an estimated cost for a GraphQL query before execution.
    Based on field count, nesting depth, and list field multipliers.
    """
 
    # Cost weights by field category
    COSTS = {
        "default_field": 1,
        "list_field": 10,       # multiplied by each nested level
        "connection_field": 5,  # cursor-based paginated lists
        "mutation": 20,         # mutations are always more expensive
        "subscription": 50,     # persistent connections are expensive
        "search_field": 15,     # elasticsearch / complex search
    }
 
    EXPENSIVE_FIELDS = {"search", "allUsers", "feed", "timeline", "recommendations"}
    LIST_INDICATORS = {"list", "all", "feed", "results", "edges", "nodes", "items"}
 
    def compute_cost(self, query_str: str, variables: dict = None) -> int:
        try:
            document = parse(query_str)
        except Exception:
            return self.COSTS["default_field"]  # cannot parse = low cost
 
        total_cost = 0
        is_mutation = False
        is_subscription = False
 
        for definition in document.definitions:
            op_type = getattr(definition, 'operation', None)
            if op_type and op_type.value == 'mutation':
                is_mutation = True
            if op_type and op_type.value == 'subscription':
                is_subscription = True
 
            total_cost += self._analyze_selection_set(
                definition.selection_set,
                depth=0
            )
 
        if is_mutation:
            total_cost += self.COSTS["mutation"]
        if is_subscription:
            total_cost += self.COSTS["subscription"]
 
        return max(1, total_cost)
 
    def _analyze_selection_set(self, selection_set, depth: int) -> int:
        if not selection_set:
            return 0
 
        cost = 0
        depth_multiplier = 1.5 ** depth  # deeper nesting = exponentially more expensive
 
        for selection in selection_set.selections:
            field_name = getattr(selection, 'name', None)
            field_name = field_name.value if field_name else ""
 
            if any(indicator in field_name.lower() for indicator in self.LIST_INDICATORS):
                cost += int(self.COSTS["list_field"] * depth_multiplier)
            elif field_name in self.EXPENSIVE_FIELDS:
                cost += int(self.COSTS["search_field"] * depth_multiplier)
            else:
                cost += int(self.COSTS["default_field"] * depth_multiplier)
 
            # Recurse into nested selections
            if hasattr(selection, 'selection_set') and selection.selection_set:
                cost += self._analyze_selection_set(selection.selection_set, depth + 1)
 
        return cost
 
 
class CostWeightedLimiter:
    """
    Rate limiter where each request deducts its 'cost' from a token bucket.
    Token bucket capacity = max compute units per window.
    """
 
    LUA_COST_BUCKET = """
    local key = KEYS[1]
    local capacity = tonumber(ARGV[1])
    local refill_rate = tonumber(ARGV[2])  -- units per second
    local now = tonumber(ARGV[3])
    local cost = tonumber(ARGV[4])
 
    local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
    local tokens = tonumber(bucket[1]) or capacity
    local last_refill = tonumber(bucket[2]) or now
 
    -- Refill tokens based on elapsed time
    local elapsed = now - last_refill
    tokens = math.min(capacity, tokens + (elapsed * refill_rate))
 
    if tokens >= cost then
        tokens = tokens - cost
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 7200)
        return {1, math.floor(tokens), capacity}
    else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 7200)
        return {0, math.floor(tokens), capacity}
    end
    """
 
    def __init__(self, redis_client, analyzer: GraphQLCostAnalyzer):
        self.redis = redis_client
        self.analyzer = analyzer
        self._script = redis_client.register_script(self.LUA_COST_BUCKET)
 
    def check_graphql(
        self,
        user_id: str,
        query: str,
        tier: str = "free"
    ) -> dict:
        # Cost limits per tier (units per minute)
        TIER_CAPACITY = {
            "free": 500,
            "starter": 2000,
            "pro": 10000,
            "enterprise": 100000
        }
        capacity = TIER_CAPACITY.get(tier, 500)
        refill_rate = capacity / 60  # units per second
 
        cost = self.analyzer.compute_cost(query)
        key = f"cost_bucket:{user_id}"
 
        result = self._script(
            keys=[key],
            args=[capacity, refill_rate, time.time(), cost]
        )
 
        return {
            "allowed": bool(result[0]),
            "tokens_remaining": result[1],
            "capacity": result[2],
            "cost_charged": cost,
            "cost_rejected_reason": None if result[0] else f"Query cost {cost} exceeds remaining {result[1]} units"
        }

When to Use

  • GraphQL APIs where query complexity varies wildly
  • REST APIs with endpoints that have very different resource costs
  • ML/AI APIs where different models or parameters have different costs
  • Any API where "1 request" is not a meaningful unit of resource consumption

Pattern 7: Tenant-Isolated Pool

Problem Context

In a multi-tenant SaaS, one tenant using Redis-backed rate limiting with unlimited calls saturates the shared Redis instance, causing OTHER tenants' rate limit checks to slow down or fail. One noisy neighbor affects everyone.

Solution

Each tenant gets a logically isolated Redis keyspace backed by a dedicated Redis connection pool. In extreme cases, high-value tenants get their own Redis instance. Rate limit operations for one tenant cannot impact another tenant's latency or availability.

Architecture Diagram

Tenant A (Free Tier)    Tenant B (Enterprise)    Tenant C (Pro)
       │                        │                       │
       ▼                        ▼                       ▼
  ┌─────────┐              ┌─────────┐             ┌─────────┐
  │ Pool A  │              │ Pool B  │             │ Pool C  │
  │ 5 conns │              │ 50 conns│             │ 20 conns│
  └────┬────┘              └────┬────┘             └────┬────┘
       │                        │                       │
       ▼                        ▼                       ▼
  ┌────────────┐         ┌───────────┐          ┌────────────┐
  │Shared Redis│         │ Dedicated │          │Shared Redis│
  │(free/pro)  │         │  Redis    │          │(free/pro)  │
  └────────────┘         │(enterprise│          └────────────┘
                         │  only)   │
                         └───────────┘

Implementation

import redis
from typing import Dict
from dataclasses import dataclass
 
@dataclass
class TenantConfig:
    tenant_id: str
    tier: str
    rate_limit: int  # requests per minute
    redis_url: str   # can be shared or dedicated
    pool_size: int   # connection pool size
 
class TenantIsolatedLimiter:
    """
    Tenant-per-pool rate limiting.
    Prevents one tenant from impacting another's rate limit check latency.
    """
 
    # Shared Redis instances by tier
    TIER_REDIS = {
        "free": "redis://shared-free:6379",
        "starter": "redis://shared-standard:6379",
        "pro": "redis://shared-premium:6379",
        # enterprise gets dedicated Redis (loaded from DB per tenant)
    }
 
    TIER_POOL_SIZE = {
        "free": 5,
        "starter": 10,
        "pro": 25,
        "enterprise": 50,
    }
 
    TIER_KEY_PREFIX = {
        "free": "t_free",
        "starter": "t_std",
        "pro": "t_prem",
        "enterprise": "t_ent",
    }
 
    def __init__(self, tenant_config_store):
        self.config_store = tenant_config_store  # your DB/cache
        self._pools: Dict[str, redis.ConnectionPool] = {}
        self._clients: Dict[str, redis.Redis] = {}
 
    def _get_client(self, tenant_id: str) -> redis.Redis:
        if tenant_id not in self._clients:
            config = self.config_store.get(tenant_id)
            if not config:
                raise ValueError(f"Unknown tenant: {tenant_id}")
 
            # Determine Redis URL
            if config.tier == "enterprise" and config.dedicated_redis_url:
                redis_url = config.dedicated_redis_url
            else:
                redis_url = self.TIER_REDIS[config.tier]
 
            pool_size = self.TIER_POOL_SIZE.get(config.tier, 5)
 
            # Create isolated pool for this tenant
            pool = redis.ConnectionPool.from_url(
                redis_url,
                max_connections=pool_size,
                socket_keepalive=True,
                socket_timeout=0.5,   # fail fast if Redis is slow
                retry_on_timeout=False
            )
            self._pools[tenant_id] = pool
            self._clients[tenant_id] = redis.Redis(connection_pool=pool)
 
        return self._clients[tenant_id]
 
    def get_key(self, tenant_id: str, user_id: str, endpoint: str) -> str:
        """
        Namespace keys by tenant to prevent cross-tenant key collisions.
        Even on shared Redis instances, keys are isolated.
        """
        prefix = self.TIER_KEY_PREFIX.get(
            self.config_store.get(tenant_id).tier, "t"
        )
        window = int(time.time() // 60)
        return f"{prefix}:{tenant_id}:{user_id}:{endpoint}:{window}"
 
    def is_allowed(self, tenant_id: str, user_id: str, endpoint: str) -> dict:
        try:
            client = self._get_client(tenant_id)
            config = self.config_store.get(tenant_id)
            key = self.get_key(tenant_id, user_id, endpoint)
 
            pipe = client.pipeline(transaction=False)
            pipe.incr(key)
            pipe.expire(key, 120)
            count, _ = pipe.execute()
 
            return {
                "allowed": count <= config.rate_limit,
                "count": count,
                "limit": config.rate_limit,
                "tenant": tenant_id
            }
 
        except redis.ConnectionError as e:
            # Tenant-isolated failure: only this tenant is affected
            return {"allowed": True, "mode": "fail_open", "error": str(e)}

When to Use

  • Multi-tenant SaaS where any enterprise customer exists
  • Systems where you have had or fear "noisy neighbor" Redis saturation
  • When per-tenant SLA guarantees are required
  • When tenants are at meaningfully different tiers (free vs enterprise)

Pattern 8: Idempotency Shield

Problem Context

A mobile client makes an API call that times out after 3 seconds. The client retries 3 times. The server actually processed the original request but the response was lost in transit. Result: the operation runs 4 times AND the user is charged 4 rate limit tokens for what is logically 1 operation.

Solution

Combine rate limiting with idempotency keys. The first call with an idempotency key is rate-limited normally. Subsequent calls with the same idempotency key within the TTL window return the cached result WITHOUT consuming additional rate limit tokens.

Implementation

import hashlib
import json
import time
 
class IdempotencyShield:
    """
    Rate limiter that is idempotency-key aware.
    Retries of the same logical operation do not consume additional rate limit tokens.
    """
 
    IDEMPOTENCY_TTL = 86400  # Idempotency window: 24 hours
 
    def __init__(self, redis_client, base_limiter):
        self.redis = redis_client
        self.base_limiter = base_limiter
 
    def extract_idempotency_key(self, request) -> str | None:
        """
        Accept idempotency key from standard header locations.
        """
        for header in ["Idempotency-Key", "X-Idempotency-Key", "X-Request-Id"]:
            key = request.headers.get(header)
            if key and 8 <= len(key) <= 128:
                return key
        return None
 
    def handle(self, request, user_id: str, handler_fn) -> dict:
        idempotency_key = self.extract_idempotency_key(request)
 
        if idempotency_key:
            # Check if we've seen this key before
            stored_key = f"idempotency:{user_id}:{idempotency_key}"
            cached = self.redis.get(stored_key)
 
            if cached:
                # Return cached response WITHOUT consuming rate limit tokens
                cached_response = json.loads(cached)
                return {
                    **cached_response,
                    "idempotency": "replay",
                    "rate_limit_consumed": False
                }
 
        # First time seeing this request (or no idempotency key)
        # Consume rate limit token normally
        rl_result = self.base_limiter.check(user_id)
        if not rl_result["allowed"]:
            return {"allowed": False, "rate_limited": True, **rl_result}
 
        # Execute the actual handler
        try:
            response = handler_fn(request)
 
            # Cache the response if idempotency key provided
            if idempotency_key:
                stored_key = f"idempotency:{user_id}:{idempotency_key}"
                self.redis.setex(
                    stored_key,
                    self.IDEMPOTENCY_TTL,
                    json.dumps({
                        "response": response,
                        "timestamp": time.time(),
                        "rate_limit_consumed": True
                    })
                )
 
            return {"allowed": True, "response": response, "idempotency": "new"}
 
        except Exception as e:
            # Don't cache failures - retry should be allowed
            raise
 
    def validate_idempotency_key(self, key: str) -> bool:
        """Prevent key injection attacks."""
        if not key:
            return False
        allowed_chars = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_.")
        return all(c in allowed_chars for c in key) and 8 <= len(key) <= 128

When to Use

  • Payment, order placement, or any non-idempotent operations
  • Mobile clients that retry on timeout (network is unreliable)
  • Any endpoint where double-execution has real consequences
  • APIs where you already use idempotency keys for correctness

Pattern 9: Sidecar Enforcer

Problem Context

You have 20 microservices. Adding rate limiting logic to each one creates maintenance sprawl. A shared rate limiting library updates break all services simultaneously. You want rate limiting to be infrastructure concern, not a code concern.

Solution

Deploy rate limiting as a sidecar proxy (Envoy or Nginx) that intercepts all traffic before it reaches the service container. The service has zero rate limiting code. The sidecar handles enforcement, observability, and routing to a global rate limit service.

Architecture Diagram

┌─────────────────────────────────────────────────────┐
│                     POD (k8s)                        │
│  ┌──────────────┐         ┌────────────────────┐    │
│  │    ENVOY     │         │  Service Container  │    │
│  │   SIDECAR    │──────▶  │  (no rate limit     │    │
│  │              │         │   code)             │    │
│  │  Intercepts  │         └────────────────────┘    │
│  │  all traffic │                                    │
│  └──────┬───────┘                                    │
└─────────┼───────────────────────────────────────────┘
          │ gRPC
          ▼
┌─────────────────────┐
│  Global Rate Limit  │
│  Service            │
│  (Lyft/Envoy impl)  │
│                     │
│  ┌───────────────┐  │
│  │ Redis Cluster │  │
│  └───────────────┘  │
└─────────────────────┘

Implementation (Envoy + Rate Limit Service)

# envoy.yaml - Sidecar configuration injected by Istio or manually
static_resources:
  clusters:
    - name: rate_limit_service
      type: STRICT_DNS
      connect_timeout: 0.5s
      http2_protocol_options: {}
      load_assignment:
        cluster_name: rate_limit_service
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: rate-limit-service.default.svc.cluster.local
                      port_value: 8081
 
  listeners:
    - name: inbound_listener
      address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: ingress
                http_filters:
                  - name: envoy.filters.http.ratelimit
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
                      domain: my-service
                      failure_mode_deny: false # fail-open when rate limit service unavailable
                      rate_limit_service:
                        grpc_service:
                          envoy_grpc:
                            cluster_name: rate_limit_service
                        transport_api_version: V3
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
                route_config:
                  virtual_hosts:
                    - name: local_service
                      domains: ["*"]
                      routes:
                        - match: { prefix: "/" }
                          route: { cluster: local_service }
                          rate_limits:
                            - actions:
                                - request_headers:
                                    header_name: "x-user-id"
                                    descriptor_key: "user_id"
                            - actions:
                                - header_value_match:
                                    descriptor_value: "payment"
                                    headers:
                                      - name: ":path"
                                        prefix_match: "/api/v1/payment"
# rate-limit-service-config.yaml (Lyft/Envoy ratelimit service)
domain: my-service
descriptors:
  # Per-user: 1000 requests/minute
  - key: user_id
    rate_limit:
      unit: MINUTE
      requests_per_unit: 1000
 
  # Per-user on payment endpoints: 10 requests/minute
  - key: user_id
    descriptors:
      - key: header_match
        value: payment
        rate_limit:
          unit: MINUTE
          requests_per_unit: 10

When to Use

  • Large microservices organizations (10+ services)
  • When you use Kubernetes with Istio or direct Envoy deployment
  • When rate limiting should be a platform concern, not a team concern
  • When you want zero rate-limiting code in your services

When NOT to Use

  • Small teams or few services (complexity isn't worth it)
  • When you need business-logic-aware limits (sidecar has no access to DB)
  • Monolithic applications

Pattern 10: Hybrid Approximate

Problem Context

At 100,000 RPS with Redis, the rate limiter itself becomes the bottleneck. Redis is receiving 100,000 INCR operations per second, consuming significant CPU. Network RTT adds 2ms to every single request. You need to dramatically reduce Redis load while keeping limits reasonably accurate.

Solution

Each application instance maintains a local token bucket. Periodically (every N seconds), it synchronizes with Redis to get the global count and returns/replenishes its local allocation. The local bucket absorbs burst traffic. Redis only handles inter-instance coordination.

Architecture Diagram

App Server 1    App Server 2    App Server 3
┌──────────┐    ┌──────────┐    ┌──────────┐
│ Local    │    │ Local    │    │ Local    │
│ Bucket   │    │ Bucket   │    │ Bucket   │
│ 33 tokens│    │ 33 tokens│    │ 33 tokens│
│ (1/3 of  │    │ (1/3 of  │    │ (1/3 of  │
│ 100/min) │    │ 100/min) │    │ 100/min) │
└────┬─────┘    └────┬─────┘    └────┬─────┘
     │               │               │
     │  Every 1 sec: sync with Redis │
     └───────────────┼───────────────┘
                     │
                ┌────▼─────┐
                │  Redis   │
                │ Global   │
                │ Counter  │
                └──────────┘

Redis receives:  3 syncs/sec instead of 100,000 ops/sec (99.997% reduction)
Accuracy tradeoff: allows up to 10-20% over limit during 1-second sync window

Implementation

import threading
import time
import redis
 
class HybridApproximateLimiter:
    """
    High-throughput rate limiter using local buckets + periodic Redis sync.
    Reduces Redis load by (RPS / sync_freq) factor.
 
    Default: sync every 1 second. At 100K RPS: 100,000x -> ~10 Redis ops/sec per user.
    """
 
    def __init__(
        self,
        redis_client: redis.Redis,
        global_limit: int,
        window_seconds: int = 60,
        sync_interval: float = 1.0,   # seconds between Redis syncs
        num_instances: int = 10,       # total application instances (approximate)
    ):
        self.redis = redis_client
        self.global_limit = global_limit
        self.window_seconds = window_seconds
        self.sync_interval = sync_interval
        self.local_quota = global_limit // num_instances  # allocation per instance
 
        # Per-user local state
        self._local_counts: dict[str, dict] = {}
        self._lock = threading.Lock()
 
        # Start background sync thread
        self._sync_thread = threading.Thread(target=self._sync_loop, daemon=True)
        self._sync_thread.start()
 
    def is_allowed(self, identifier: str) -> bool:
        with self._lock:
            now = time.time()
            window_key = int(now // self.window_seconds)
 
            if identifier not in self._local_counts:
                self._local_counts[identifier] = {
                    "window": window_key,
                    "local_used": 0,
                    "local_quota": self.local_quota,
                    "global_used": 0
                }
 
            state = self._local_counts[identifier]
 
            # New window: reset
            if state["window"] != window_key:
                state["window"] = window_key
                state["local_used"] = 0
                state["local_quota"] = self.local_quota
                state["global_used"] = 0
 
            # Check local quota first (no Redis involved)
            if state["local_used"] >= state["local_quota"]:
                return False  # Local quota exhausted, wait for sync
 
            state["local_used"] += 1
            return True
 
    def _sync_loop(self):
        """Background thread: periodically sync local counts to Redis."""
        while True:
            time.sleep(self.sync_interval)
            self._sync_all()
 
    def _sync_all(self):
        with self._lock:
            now = time.time()
            window_key = int(now // self.window_seconds)
            identifiers = list(self._local_counts.keys())
 
        for identifier in identifiers:
            self._sync_one(identifier, window_key)
 
    def _sync_one(self, identifier: str, window_key: int):
        """
        Push local count to Redis, get global count back,
        recalculate local quota.
        """
        with self._lock:
            if identifier not in self._local_counts:
                return
            state = self._local_counts[identifier]
            if state["window"] != window_key:
                return
            local_delta = state["local_used"] - state.get("last_synced", 0)
            state["last_synced"] = state["local_used"]
 
        if local_delta <= 0:
            return
 
        redis_key = f"hybrid:{identifier}:{window_key}"
        try:
            pipe = self.redis.pipeline()
            pipe.incrby(redis_key, local_delta)
            pipe.expire(redis_key, self.window_seconds * 2)
            results = pipe.execute()
            global_count = results[0]
 
            # Recalculate remaining quota for this instance
            remaining_global = max(0, self.global_limit - global_count)
            # Assume equal distribution among instances
            new_local_quota = max(1, remaining_global // 10)  # rough estimate
 
            with self._lock:
                if identifier in self._local_counts:
                    state = self._local_counts[identifier]
                    state["global_used"] = global_count
                    state["local_quota"] = min(new_local_quota, self.local_quota)
                    if global_count >= self.global_limit:
                        state["local_quota"] = 0  # Global limit reached, stop local
 
        except redis.RedisError:
            pass  # Continue with existing local quota on Redis error

When to Use

  • Extreme-throughput systems (>50K RPS per user class)
  • When Redis latency overhead is measurably impacting p99 response time
  • When approximate limits (±15%) are acceptable for the use case
  • Internal high-frequency APIs, analytics ingest, logging endpoints

When NOT to Use

  • Security-critical endpoints (±15% over-limit is unacceptable for payments/auth)
  • When exact quota accounting is required for billing
  • Low-RPS systems where Redis overhead isn't an issue

ADR Template Reference

Use this template for documenting your own rate limiting architecture decisions.

# ADR-NNN: [Title]
 
**Date:** YYYY-MM-DD
**Status:** [Proposed | Accepted | Deprecated | Superseded by ADR-NNN]
**Deciders:** [List of people involved in the decision]
 
## Context
 
[What is the situation, constraint, or problem that requires a decision?
Be specific about the scale, users, endpoints, and failure modes involved.]
 
## Decision Drivers
 
- [Driver 1: e.g., "Must support 100K RPS with <5ms overhead"]
- [Driver 2: e.g., "No Redis license budget - must use DynamoDB"]
- [Driver 3: e.g., "Team has no distributed systems experience"]
 
## Options Considered
 
### Option A: [Name]
 
- Pros: ...
- Cons: ...
 
### Option B: [Name]
 
- Pros: ...
- Cons: ...
 
## Decision
 
[We chose Option X because... Reference the decision drivers explicitly.]
 
## Consequences
 
**Positive:**
 
- [Consequence 1]
 
**Negative / Accepted Trade-offs:**
 
- [Trade-off 1 and why it's acceptable]
 
## Implementation Notes
 
[Key technical details, configuration values, or gotchas to be aware of.]
 
## Review Date
 
[When should this decision be revisited? e.g., "When traffic exceeds 1M RPS"]

Pattern Comparison Matrix

PatternBest ForComplexityLatency OverheadAccuracyRedis Dependency
Gateway SentinelMulti-service orgsLowMediumHighYes
Layered DefensePublic APIs, DDoS riskMediumLow (CDN/Nginx = 0ms)HighOptional at each layer
Quota CascadeB2B multi-tenantHighMediumExactYes
Shadow EnforcementAdding RL to live systemMediumNone (shadow mode)N/AYes
Adaptive ThrottleHigh-traffic, failure-proneHighLowProportionalYes
Cost-Weighted BucketGraphQL, ML, complex opsHighMediumBy cost modelYes
Tenant-Isolated PoolMulti-tenant SaaSMediumLowHighYes (per-tenant)
Idempotency ShieldPayment, mobile, retriesMediumLowExactYes
Sidecar EnforcerMicroservices, k8sHigh (infra)LowHighYes
Hybrid ApproximateExtreme RPS (>50K)HighNear-zeroApproximate (±15%)Yes (sync only)

Series Complete. All supplements: