Rate Limiting - Supplement 4: Architecture Patterns and Decision-Making
Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 3 - Trade-Offs Decision Guide
Ten named architecture patterns used by real production systems.
Each pattern includes: Problem → Solution → Architecture Diagram → Code →
ADR (Architecture Decision Record) → Trade-offs → When/When-Not to Use.
Table of Contents
- Pattern 1: Gateway Sentinel
- Pattern 2: Layered Defense
- Pattern 3: Quota Cascade
- Pattern 4: Shadow Enforcement
- Pattern 5: Adaptive Throttle
- Pattern 6: Cost-Weighted Bucket
- Pattern 7: Tenant-Isolated Pool
- Pattern 8: Idempotency Shield
- Pattern 9: Sidecar Enforcer
- Pattern 10: Hybrid Approximate
- ADR Template Reference
- Pattern Comparison Matrix
Pattern 1: Gateway Sentinel
Problem Context
Your organization has dozens of microservices. Each team is reinventing rate limiting in their service using different libraries, different Redis key formats, and inconsistent policies. Some services have rate limiting, some don't. There is no unified enforcement or visibility.
Solution
Centralize ALL rate limiting logic at the API Gateway layer. Every request must pass through the gateway before reaching any service. Rate limit decisions are made once, consistently, with full observability in one place.
Architecture Diagram
┌──────────────────────────────────────────┐
│ INTERNET / CLIENTS │
└───────────────────┬──────────────────────┘
│
┌───────────────────▼──────────────────────┐
│ API GATEWAY │
│ │
│ ┌────────────┐ ┌─────────────────┐ │
│ │ Rate Limit │───▶│ Redis Cluster │ │
│ │ Engine │ │ (shared state) │ │
│ └─────┬──────┘ └─────────────────┘ │
│ │ │
│ ┌─────▼──────┐ │
│ │ 429 or │ │
│ │ Route to │ │
│ │ Service │ │
└───┴─────┬──────┴─────────────────────────┘
│
┌────────────────┼──────────────────┐
│ │ │
┌─────────▼──────┐ ┌───────▼─────┐ ┌─────────▼──────┐
│ Service A │ │ Service B │ │ Service C │
│ (no rate limit │ │ (no rate │ │ (no rate │
│ needed here) │ │ limit) │ │ limit) │
└─────────────────┘ └─────────────┘ └─────────────────┘
Implementation (Kong Declarative)
# kong.yml - All rate limiting defined centrally
_format_version: "3.0"
consumers:
- username: free-tier
custom_id: tier_free
- username: pro-tier
custom_id: tier_pro
- username: enterprise-tier
custom_id: tier_enterprise
plugins:
# Global default - applies to ALL routes unless overridden
- name: rate-limiting-advanced
config:
limit: [60]
window_size: [60]
sync_rate: 5 # sync with Redis every 5 seconds
strategy: redis
redis:
host: redis-master
port: 6379
database: 0
hide_client_headers: false
error_message: "Rate limit exceeded. See https://docs.example.com/rate-limits"
# Service-specific override: authentication (stricter)
- name: rate-limiting-advanced
service: auth-service
config:
limit: [5, 20] # 5/min AND 20/hour
window_size: [60, 3600]
strategy: redis
# Consumer-specific: Pro tier gets 10x
- name: rate-limiting-advanced
consumer: pro-tier
config:
limit: [600]
window_size: [60]
strategy: redis
# Consumer-specific: Enterprise tier (very high)
- name: rate-limiting-advanced
consumer: enterprise-tier
config:
limit: [10000]
window_size: [60]
strategy: redisADR-001: Gateway Sentinel
Date: 2024-01-15
Status: Accepted
Context:
- 12 microservices, 8 teams, zero consistent rate limiting
- Security audit revealed 3 services with no limits on sensitive endpoints
- Engineers spending time reimplementing the same patterns
Decision:
Consolidate all rate limiting at the API Gateway (Kong) backed by shared Redis.
Individual services MUST NOT implement their own rate limiting (lint rule enforced).
Consequences — Positive:
- Single place to audit and change rate limit policies
- Consistent response format (HTTP 429 + Retry-After) across all services
- Ops team can change limits without service deployments
- End-to-end visibility in one dashboard
Consequences — Negative:
- Gateway becomes a critical single point of failure for rate limiting
- Cannot implement business-logic-aware limits (tier from DB)
- Teams lose autonomy to tune their own endpoints
Mitigation:
- Gateway in active-active HA configuration
- Services still validate authentication (gateway only does auth extraction)
- Business-logic limits added as a second layer in 2 critical services
Trade-Offs
| Pro | Con |
|---|---|
| Zero rate-limit code in services | No access to service business context |
| Centralized policy management | Gateway is a SPOF for RL logic |
| Consistent UX across all APIs | Must redeploy/reconfigure gateway for limit changes |
| One place to monitor and alert | Hard to test rate limits in local dev |
When to Use
- Multi-service organization with inconsistent RL implementations
- Developer-facing API product (developer portal + usage dashboards)
- Teams lack expertise to implement correct distributed rate limiting
When NOT to Use
- Your service has complex business logic requirements for limits
- You want zero latency overhead (gateway adds ~2ms)
- Internal-only microservices where enforcement is overkill
Pattern 2: Layered Defense
Problem Context
A single rate limiting layer can be bypassed or overwhelmed. An IP rotation attack defeats user-level limits. A DDoS at Layer 3 overwhelms application-level checks. You need defense in depth.
Solution
Implement independent rate limiting at multiple layers. Each layer catches a different class of abuse:
- CDN/Edge: Volumetric DDoS, geographic abuse, known bad IPs
- Load Balancer: IP-level flood protection, basic bot mitigation
- API Gateway: API key or consumer-level limits
- Application: User-tier-aware, endpoint-specific, business logic
If an attacker bypasses one layer, subsequent layers still protect the system.
Architecture Diagram
INTERNET
│
▼
┌──────────────────────────────────────────┐
│ LAYER 1: CDN (Cloudflare) │
│ - DDoS mitigation │ ← Blocks: Volumetric floods,
│ - IP reputation blocking │ known bad IPs,
│ - Geographic restrictions │ geographic violations
│ - Rate: 1000 req/5min per IP │
└──────────────────┬───────────────────────┘
│ (attacks blocked above don't reach here)
▼
┌──────────────────────────────────────────┐
│ LAYER 2: Load Balancer (Nginx) │
│ - Per-IP: 200 req/min │ ← Blocks: IP floods that passed CDN,
│ - Per-IP burst: 50 │ layer 7 DDoS attempts
│ - Connection limit: 100 concurrent/IP │
└──────────────────┬───────────────────────┘
│ (only reasonable-rate IPs reach here)
▼
┌──────────────────────────────────────────┐
│ LAYER 3: Application Middleware │
│ - Per-user: tier-based limit │ ← Blocks: Authenticated abuse,
│ - Per-endpoint: cost-weighted │ quota exhaustion,
│ - Per-feature-flag: new features │ insider threats
│ - Business logic (user tier) │
└──────────────────┬───────────────────────┘
│ (only valid, non-rate-limited requests reach here)
▼
┌──────────────┐
│ SERVICE │
│ (business │
│ logic) │
└──────────────┘
Implementation
# Application layer (after CDN and Nginx have already filtered)
from dataclasses import dataclass
from typing import Optional
@dataclass
class RateLimitConfig:
identifier: str
limit: int
window_seconds: int
layer: str
class LayeredDefenseMiddleware:
"""
Application-layer component of a three-layer defense.
CDN and Nginx provide layers 1 and 2.
This provides the business-context-aware layer 3.
"""
def __init__(self, redis_client, user_service):
self.redis = redis_client
self.user_service = user_service
def get_applicable_limits(self, request) -> list[RateLimitConfig]:
limits = []
user_id = getattr(request.state, "user_id", None)
endpoint = request.url.path
method = request.method
if user_id:
user = self.user_service.get_cached(user_id)
tier_limit = {
"free": 60,
"starter": 300,
"pro": 1000,
"enterprise": 10000
}.get(user.tier, 60)
limits.append(RateLimitConfig(
identifier=f"user:{user_id}",
limit=tier_limit,
window_seconds=60,
layer="user_tier"
))
# Endpoint-specific limit (e.g., expensive ML endpoint)
if endpoint.startswith("/api/v1/analyze"):
limits.append(RateLimitConfig(
identifier=f"user:{user_id}:analyze",
limit=max(1, tier_limit // 10), # 10% of normal limit
window_seconds=60,
layer="endpoint_specific"
))
# Always check global system limit (backstop)
limits.append(RateLimitConfig(
identifier="global:system",
limit=500_000,
window_seconds=60,
layer="global_backstop"
))
return limits
async def __call__(self, request, call_next):
if not should_rate_limit(request):
return await call_next(request)
configs = self.get_applicable_limits(request)
for config in configs:
result = self._check_redis(config)
if not result["allowed"]:
return build_429_response(result, config.layer)
return await call_next(request)ADR-002: Layered Defense
Date: 2024-03-10
Status: Accepted
Context:
- Single application-layer rate limiter was being overwhelmed by botnets
- IP rotation attacks with thousands of IPs bypassed per-IP limits
- Infrastructure cost spiked during attacks even though business logic blocked abusive users
Decision:
Add CDN-level and Nginx-level rate limiting in front of the application.
Each layer operates independently with its own Redis or local state.
Consequences — Positive:
- 95% of attack traffic blocked before hitting application servers
- CDN/Nginx layers add 0ms application latency overhead
- Reduces application compute cost during attacks
- Each layer can be tuned independently
Consequences — Negative:
- Three different configuration locations to manage
- Risk of inconsistency between layers
- Debugging requires checking all three layers
When to Use
- Any public-facing system that has experienced or anticipates DDoS/abuse
- APIs with expensive compute where even processing rate-limited requests costs money
- When application-layer rate limiting alone isn't fast enough to stop floods
When NOT to Use
- Purely internal services with trusted callers
- Simple intranet tools
- When a CDN is cost-prohibitive
Pattern 3: Quota Cascade
Problem Context
You have a multi-tenant SaaS with organizations containing teams containing users. An organization purchases 100,000 API calls/month. The admin needs to allocate portions to teams, and teams allocate to users. When a quota is exceeded at any level, requests are rejected.
Solution
Implement hierarchical quota management where limits cascade from organization → team → user. Each level has its own quota counter. A request must pass ALL levels. When a quota is refilled at the top, it flows down (but teams/users retain their sub-allocations).
Architecture Diagram
Organization: ACME Corp
├── Monthly Quota: 100,000 calls
│ │
│ ├── Team: Engineering [Allocated: 60,000]
│ │ ├── User: alice@acme.com [Allocated: 20,000]
│ │ ├── User: bob@acme.com [Allocated: 20,000]
│ │ └── User: carol@acme.com [Allocated: 20,000]
│ │
│ └── Team: Marketing [Allocated: 40,000]
│ ├── User: dave@acme.com [Allocated: 15,000]
│ └── User: eve@acme.com [Allocated: 25,000]
│
│ Request from alice:
│ CHECK 1: alice used 19,999 of 20,000 -> PASS (1 remaining)
│ CHECK 2: Engineering used 59,999 of 60,000 -> PASS (1 remaining)
│ CHECK 3: ACME used 99,999 of 100,000 -> PASS (1 remaining)
│ REQUEST ALLOWED. All 3 counters incremented.
Implementation
from typing import Optional
import redis
import time
class QuotaCascade:
"""
Multi-level quota enforcement: User -> Team -> Organization
Uses Redis pipelines for efficient multi-level checking.
"""
def __init__(self, redis_client: redis.Redis, db):
self.redis = redis_client
self.db = db # database for allocation lookup
def get_quota_hierarchy(self, user_id: str, org_id: str, team_id: str) -> list[dict]:
"""Build the quota check hierarchy for this request."""
# Cache quota allocations to avoid DB hit per-request
cache_key = f"quota_alloc:{org_id}:{team_id}:{user_id}"
cached = self.redis.get(cache_key)
if not cached:
alloc = self.db.get_quota_allocations(org_id, team_id, user_id)
self.redis.setex(cache_key, 300, str(alloc)) # cache 5 minutes
else:
alloc = eval(cached) # in production: use JSON or msgpack
period = self.current_billing_period()
return [
{
"key": f"quota:{period}:org:{org_id}",
"limit": alloc["org_monthly"],
"level": "organization",
"entity": org_id,
},
{
"key": f"quota:{period}:team:{org_id}:{team_id}",
"limit": alloc["team_monthly"],
"level": "team",
"entity": team_id,
},
{
"key": f"quota:{period}:user:{user_id}",
"limit": alloc["user_monthly"],
"level": "user",
"entity": user_id,
},
]
# Lua script: check and increment all quota levels atomically
LUA_CASCADE_CHECK = """
local results = {}
for i = 1, #KEYS do
local current = tonumber(redis.call('GET', KEYS[i]) or 0)
local limit = tonumber(ARGV[i])
if current >= limit then
results[i] = 0 -- exceeded at this level
else
results[i] = 1 -- allowed at this level
end
end
-- Only increment if ALL levels pass
local all_pass = true
for i = 1, #results do
if results[i] == 0 then
all_pass = false
break
end
end
if all_pass then
for i = 1, #KEYS do
redis.call('INCR', KEYS[i])
-- Set TTL to end of billing period if key is new
if redis.call('TTL', KEYS[i]) == -1 then
redis.call('EXPIREAT', KEYS[i], tonumber(ARGV[#ARGV]))
end
end
end
return {all_pass and 1 or 0, results}
"""
def __init_lua(self):
self._cascade_script = self.redis.register_script(self.LUA_CASCADE_CHECK)
def check_and_consume(self, user_id: str, org_id: str, team_id: str) -> dict:
hierarchy = self.get_quota_hierarchy(user_id, org_id, team_id)
keys = [h["key"] for h in hierarchy]
limits = [h["limit"] for h in hierarchy]
billing_period_end = self.billing_period_end_timestamp()
result = self._cascade_script(
keys=keys,
args=limits + [billing_period_end]
)
allowed = bool(result[0])
level_results = result[1]
if not allowed:
# Find which level blocked
for i, level_result in enumerate(level_results):
if level_result == 0:
h = hierarchy[i]
return {
"allowed": False,
"blocked_at": h["level"],
"entity": h["entity"],
"message": f"{h['level'].capitalize()} quota exceeded"
}
return {"allowed": True, "remaining": self._get_minimums(keys, limits)}
def current_billing_period(self) -> str:
now = time.gmtime()
return f"{now.tm_year}-{now.tm_mon:02d}"
def billing_period_end_timestamp(self) -> int:
"""Unix timestamp for the last second of the current billing month."""
import calendar
now = time.gmtime()
last_day = calendar.monthrange(now.tm_year, now.tm_mon)[1]
return int(time.mktime((now.tm_year, now.tm_mon, last_day, 23, 59, 59, 0, 0, 0)))ADR-003: Quota Cascade
Date: 2024-05-20
Status: Accepted
Context:
- Enterprise customers buying annual API quotas for their entire organization
- No mechanism to prevent one power user from consuming the org's entire quota
- Customer success team receiving complaints about "quota not shared fairly"
Decision:
Implement three-level quota hierarchy: Organization → Team → User.
All three levels are checked atomically in Redis using a Lua script.
Quotas are monthly-resetting, allocated by org admins in a self-service portal.
Consequences — Positive:
- Org admins have full control over allocation to their teams
- One user cannot exhaust the organization quota
- Transparent: API response includes which level was exhausted
Consequences — Negative:
- Allocation misconfiguration causes confusion (admin must keep levels in sync)
- Complex Lua script for atomic multi-level check
- Billing period reset logic adds complexity
When to Use
- Multi-tenant B2B SaaS with enterprise customers
- Organizations that need to allocate quotas across teams/departments
- Situations where per-user limits alone are insufficient
When NOT to Use
- B2C consumer apps (users don't form organizational hierarchies)
- Simple per-user API products
- When billing logic is external (e.g., external quota management service)
Pattern 4: Shadow Enforcement
Problem Context
Your service has no rate limiting and you need to add it without causing incidents. If you set the wrong limit and enforce immediately, you'll break real users on day one. You need a way to validate your limit values against real production traffic before enforcing.
Solution
Implement rate limiting in three phases:
- Shadow mode: Count requests, record violations, NO blocking. Monitor who would have been blocked.
- Warn mode: Send 200 with
X-RateLimit-Warning: You would have been rate limitedheader. Still no blocking. - Enforce mode: Full enforcement with 429 responses.
Move from shadow → warn → enforce over 2-4 weeks. Only move forward when violation rates are acceptable.
Architecture Diagram
Request comes in
│
▼
┌──────────────────────┐
│ Rate Limit Check │
│ (always runs) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Would this be │
│ rate limited? │
└──────────┬───────────┘
│
┌──────┴──────┐
│ YES │ NO
▼ ▼
┌────────┐ ┌──────────┐
│ Check │ │ Allow │
│ Mode │ │ request │
└───┬────┘ └──────────┘
│
┌────┴──────┬──────────────┐
│ SHADOW │ WARN │ ENFORCE
│ │ │
│ Allow + │ Allow + │ Reject
│ Log only │ Warn header │ HTTP 429
└───────────┴──────────────┘
Implementation
from enum import Enum
import logging
class EnforcementMode(Enum):
SHADOW = "shadow" # Count but never block
WARN = "warn" # Allow but add warning header
ENFORCE = "enforce" # Full enforcement
class ShadowEnforcement:
"""
Progressive rate limit rollout.
Mode is controlled by a feature flag (Redis or config service).
"""
def __init__(self, limiter, mode_store, metrics):
self.limiter = limiter
self.mode_store = mode_store # Redis-backed feature flag store
self.metrics = metrics
self.logger = logging.getLogger(__name__)
def get_mode(self, endpoint: str) -> EnforcementMode:
"""Mode can be set globally or per-endpoint."""
mode_str = self.mode_store.get(f"rl_mode:{endpoint}") \
or self.mode_store.get("rl_mode:global") \
or "shadow"
return EnforcementMode(mode_str)
async def __call__(self, request, call_next):
if not should_rate_limit(request):
return await call_next(request)
endpoint = request.url.path
user_id = getattr(request.state, "user_id", "anonymous")
mode = self.get_mode(endpoint)
# Always compute the rate limit result
result = self.limiter.check(user_id, endpoint)
would_be_blocked = not result["allowed"]
# Track shadow metrics regardless of mode
if would_be_blocked:
self.metrics.increment(
"rate_limit.would_block",
tags={
"endpoint": endpoint,
"mode": mode.value,
"user_tier": getattr(request.state, "tier", "unknown")
}
)
self.logger.info(
"rate_limit_shadow",
extra={
"user_id": user_id,
"endpoint": endpoint,
"mode": mode.value,
"limit": result["limit"],
"current": result["current"]
}
)
# Behavior depends on mode
if mode <mark class="obsidian-highlight"> EnforcementMode.ENFORCE and would_be_blocked:
# Actually consume the token and return 429
self.limiter.consume(user_id, endpoint)
return build_429_response(result)
elif mode </mark> EnforcementMode.WARN and would_be_blocked:
# Allow the request but warn
response = await call_next(request)
response.headers["X-RateLimit-Warning"] = (
f"You are exceeding your rate limit. "
f"Enforcement begins {self.get_enforce_date()}. "
f"Limit: {result['limit']}/min, Current: {result['current']}/min"
)
return response
# SHADOW mode or within limits: allow normally
return await call_next(request)Ops runbook for progressive rollout:
# Step 1: Deploy with shadow mode (default)
redis-cli SET rl_mode:global shadow
# Step 2: After 1 week, review shadow metrics
redis-cli GET rl_mode:global
# Review Datadog dashboard: "Rate Limit Shadow Violations by User"
# If violation rate > 5% of users: adjust limits up
# If violation rate < 0.1%: limits may be too high (consider lowering)
# Step 3: Move to warn mode (users get header, not blocked)
redis-cli SET rl_mode:global warn
# Step 4: After 1 more week and comms to impacted users, enforce
redis-cli SET rl_mode:global enforce
# Per-endpoint override if one endpoint needs different timeline:
redis-cli SET rl_mode:/api/v1/search enforce # enforce search earlier
redis-cli SET rl_mode:/api/v1/reports warn # keep reports in warnWhen to Use
- Adding rate limiting to an existing system with real users for the first time
- Changing rate limit values significantly (e.g., halving free tier limits)
- Launching a new endpoint where the right limit is unknown
- Before big limit changes that could trigger SLA violations for enterprise customers
When NOT to Use
- New greenfield system (enforce from day one)
- Security-critical endpoints (login, payment - enforce immediately)
- When you are under active attack (shadow mode helps attackers)
Pattern 5: Adaptive Throttle
Problem Context
Static rate limits don't account for system health. When your database is at 95% CPU, even requests within rate limits can cause cascading failures. You need rate limits that automatically tighten when the system is stressed.
Solution
Implement dynamic rate limits that respond to real-time system health signals. When health degrades, limits decrease automatically. When health recovers, limits restore. This prevents overload without requiring manual intervention.
Architecture Diagram
Health Signal Sources:
┌───────────────┐ ┌──────────────┐ ┌───────────────────┐
│ CPU / Memory │ │ Error Rate │ │ DB Connection │
│ Metrics │ │ (p99 latency)│ │ Pool Utilization │
└───────┬───────┘ └──────┬───────┘ └─────────┬─────────┘
│ │ │
└─────────────────┼─────────────────────┘
│
┌────────▼────────┐
│ Health Score │
│ Calculator │
│ (0.0 - 1.0) │
└────────┬────────┘
│
┌────────▼────────┐
│ Limit Adjuster │
│ │
│ base_limit x │
│ health_factor │
└────────┬────────┘
│
┌────────▼────────┐
│ Rate Limiter │
│ (dynamic limit)│
└─────────────────┘
Implementation
import time
import math
from dataclasses import dataclass
from typing import Protocol
@dataclass
class SystemHealth:
cpu_percent: float # 0-100
error_rate: float # 0.0-1.0 (fraction of requests erroring)
p99_latency_ms: float # milliseconds
db_pool_pct: float # 0-100 (percent of pool in use)
memory_percent: float # 0-100
class HealthCollector(Protocol):
def collect(self) -> SystemHealth: ...
class AdaptiveThrottle:
"""
Adjusts effective rate limits based on real-time system health.
Health Score (0.0 = catastrophic, 1.0 = perfect):
0.9 - 1.0: Full limit (normal operations)
0.7 - 0.9: 75% of limit (minor degradation)
0.5 - 0.7: 50% of limit (moderate stress)
0.3 - 0.5: 25% of limit (high stress)
0.0 - 0.3: 10% of limit (near failure - protect the system)
"""
# Tunable thresholds
THRESHOLDS = {
"cpu": {"warn": 70, "critical": 85},
"error_rate": {"warn": 0.01, "critical": 0.05},
"p99_latency_ms": {"warn": 500, "critical": 2000},
"db_pool_pct": {"warn": 70, "critical": 90},
}
def __init__(self, base_limiter, health_collector: HealthCollector, redis_client):
self.base_limiter = base_limiter
self.health_collector = health_collector
self.redis = redis_client
self._health_cache_ttl = 5 # recalculate health every 5 seconds
self._last_health_time = 0
self._cached_factor = 1.0
def compute_health_factor(self, health: SystemHealth) -> float:
"""
Returns a multiplier 0.1 - 1.0 for the base rate limit.
Components are independent - worst signal wins.
"""
factors = []
# CPU factor
cpu = health.cpu_percent
if cpu < self.THRESHOLDS["cpu"]["warn"]:
factors.append(1.0)
elif cpu < self.THRESHOLDS["cpu"]["critical"]:
# Linear interpolation: 70%->1.0, 85%->0.5
factors.append(1.0 - 0.5 * (cpu - 70) / 15)
else:
# 85%+ CPU: heavily throttle
factors.append(max(0.1, 0.5 - 0.4 * (cpu - 85) / 15))
# Error rate factor
err = health.error_rate
if err < self.THRESHOLDS["error_rate"]["warn"]:
factors.append(1.0)
elif err < self.THRESHOLDS["error_rate"]["critical"]:
factors.append(0.75)
else:
factors.append(0.25)
# P99 latency factor
lat = health.p99_latency_ms
if lat < self.THRESHOLDS["p99_latency_ms"]["warn"]:
factors.append(1.0)
elif lat < self.THRESHOLDS["p99_latency_ms"]["critical"]:
factors.append(0.6)
else:
factors.append(0.2)
# DB pool factor
pool = health.db_pool_pct
if pool < self.THRESHOLDS["db_pool_pct"]["warn"]:
factors.append(1.0)
elif pool < self.THRESHOLDS["db_pool_pct"]["critical"]:
factors.append(0.5)
else:
factors.append(0.1)
return min(factors) # Worst signal determines throttle level
def get_effective_limit(self, base_limit: int) -> int:
now = time.time()
if now - self._last_health_time > self._health_cache_ttl:
health = self.health_collector.collect()
self._cached_factor = self.compute_health_factor(health)
self._last_health_time = now
# Publish health factor for dashboards
self.redis.set("system:health_factor", str(self._cached_factor), ex=30)
effective = max(1, int(base_limit * self._cached_factor))
return effective
def is_allowed(self, identifier: str, base_limit: int) -> dict:
effective_limit = self.get_effective_limit(base_limit)
result = self.base_limiter.is_allowed(identifier, limit=effective_limit)
result["base_limit"] = base_limit
result["effective_limit"] = effective_limit
result["health_factor"] = self._cached_factor
return resultWhen to Use
- High-traffic services where cascading failure is a real risk
- Services with unpredictable traffic spikes that can overload downstream dependencies
- When you want automatic protection without manual intervention during incidents
When NOT to Use
- Low-traffic internal services where overload is unlikely
- Services with SLA guarantees that prohibit degradation below a certain request rate
- When health signals are unreliable or expensive to collect
Pattern 6: Cost-Weighted Bucket
Problem Context
A GraphQL endpoint where query { user { id } } costs 1ms and query { allUsers { posts { comments { likes } } } } costs 5,000ms. Treating both as equivalent "1 request" allows a single client to consume 5,000x more resources while counting only 1 against their rate limit.
Solution
Assign a "cost" to each operation before execution. Deduct the cost from the token bucket instead of a flat count of 1. The bucket size represents compute units (e.g., 1,000 units/minute), not request count. Simple queries cost 1-5 units; complex queries cost 50-500 units.
Implementation
from graphql import parse, build_ast_schema
from typing import Any
class GraphQLCostAnalyzer:
"""
Computes an estimated cost for a GraphQL query before execution.
Based on field count, nesting depth, and list field multipliers.
"""
# Cost weights by field category
COSTS = {
"default_field": 1,
"list_field": 10, # multiplied by each nested level
"connection_field": 5, # cursor-based paginated lists
"mutation": 20, # mutations are always more expensive
"subscription": 50, # persistent connections are expensive
"search_field": 15, # elasticsearch / complex search
}
EXPENSIVE_FIELDS = {"search", "allUsers", "feed", "timeline", "recommendations"}
LIST_INDICATORS = {"list", "all", "feed", "results", "edges", "nodes", "items"}
def compute_cost(self, query_str: str, variables: dict = None) -> int:
try:
document = parse(query_str)
except Exception:
return self.COSTS["default_field"] # cannot parse = low cost
total_cost = 0
is_mutation = False
is_subscription = False
for definition in document.definitions:
op_type = getattr(definition, 'operation', None)
if op_type and op_type.value == 'mutation':
is_mutation = True
if op_type and op_type.value == 'subscription':
is_subscription = True
total_cost += self._analyze_selection_set(
definition.selection_set,
depth=0
)
if is_mutation:
total_cost += self.COSTS["mutation"]
if is_subscription:
total_cost += self.COSTS["subscription"]
return max(1, total_cost)
def _analyze_selection_set(self, selection_set, depth: int) -> int:
if not selection_set:
return 0
cost = 0
depth_multiplier = 1.5 ** depth # deeper nesting = exponentially more expensive
for selection in selection_set.selections:
field_name = getattr(selection, 'name', None)
field_name = field_name.value if field_name else ""
if any(indicator in field_name.lower() for indicator in self.LIST_INDICATORS):
cost += int(self.COSTS["list_field"] * depth_multiplier)
elif field_name in self.EXPENSIVE_FIELDS:
cost += int(self.COSTS["search_field"] * depth_multiplier)
else:
cost += int(self.COSTS["default_field"] * depth_multiplier)
# Recurse into nested selections
if hasattr(selection, 'selection_set') and selection.selection_set:
cost += self._analyze_selection_set(selection.selection_set, depth + 1)
return cost
class CostWeightedLimiter:
"""
Rate limiter where each request deducts its 'cost' from a token bucket.
Token bucket capacity = max compute units per window.
"""
LUA_COST_BUCKET = """
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2]) -- units per second
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local bucket = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(bucket[1]) or capacity
local last_refill = tonumber(bucket[2]) or now
-- Refill tokens based on elapsed time
local elapsed = now - last_refill
tokens = math.min(capacity, tokens + (elapsed * refill_rate))
if tokens >= cost then
tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 7200)
return {1, math.floor(tokens), capacity}
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 7200)
return {0, math.floor(tokens), capacity}
end
"""
def __init__(self, redis_client, analyzer: GraphQLCostAnalyzer):
self.redis = redis_client
self.analyzer = analyzer
self._script = redis_client.register_script(self.LUA_COST_BUCKET)
def check_graphql(
self,
user_id: str,
query: str,
tier: str = "free"
) -> dict:
# Cost limits per tier (units per minute)
TIER_CAPACITY = {
"free": 500,
"starter": 2000,
"pro": 10000,
"enterprise": 100000
}
capacity = TIER_CAPACITY.get(tier, 500)
refill_rate = capacity / 60 # units per second
cost = self.analyzer.compute_cost(query)
key = f"cost_bucket:{user_id}"
result = self._script(
keys=[key],
args=[capacity, refill_rate, time.time(), cost]
)
return {
"allowed": bool(result[0]),
"tokens_remaining": result[1],
"capacity": result[2],
"cost_charged": cost,
"cost_rejected_reason": None if result[0] else f"Query cost {cost} exceeds remaining {result[1]} units"
}When to Use
- GraphQL APIs where query complexity varies wildly
- REST APIs with endpoints that have very different resource costs
- ML/AI APIs where different models or parameters have different costs
- Any API where "1 request" is not a meaningful unit of resource consumption
Pattern 7: Tenant-Isolated Pool
Problem Context
In a multi-tenant SaaS, one tenant using Redis-backed rate limiting with unlimited calls saturates the shared Redis instance, causing OTHER tenants' rate limit checks to slow down or fail. One noisy neighbor affects everyone.
Solution
Each tenant gets a logically isolated Redis keyspace backed by a dedicated Redis connection pool. In extreme cases, high-value tenants get their own Redis instance. Rate limit operations for one tenant cannot impact another tenant's latency or availability.
Architecture Diagram
Tenant A (Free Tier) Tenant B (Enterprise) Tenant C (Pro)
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Pool A │ │ Pool B │ │ Pool C │
│ 5 conns │ │ 50 conns│ │ 20 conns│
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
▼ ▼ ▼
┌────────────┐ ┌───────────┐ ┌────────────┐
│Shared Redis│ │ Dedicated │ │Shared Redis│
│(free/pro) │ │ Redis │ │(free/pro) │
└────────────┘ │(enterprise│ └────────────┘
│ only) │
└───────────┘
Implementation
import redis
from typing import Dict
from dataclasses import dataclass
@dataclass
class TenantConfig:
tenant_id: str
tier: str
rate_limit: int # requests per minute
redis_url: str # can be shared or dedicated
pool_size: int # connection pool size
class TenantIsolatedLimiter:
"""
Tenant-per-pool rate limiting.
Prevents one tenant from impacting another's rate limit check latency.
"""
# Shared Redis instances by tier
TIER_REDIS = {
"free": "redis://shared-free:6379",
"starter": "redis://shared-standard:6379",
"pro": "redis://shared-premium:6379",
# enterprise gets dedicated Redis (loaded from DB per tenant)
}
TIER_POOL_SIZE = {
"free": 5,
"starter": 10,
"pro": 25,
"enterprise": 50,
}
TIER_KEY_PREFIX = {
"free": "t_free",
"starter": "t_std",
"pro": "t_prem",
"enterprise": "t_ent",
}
def __init__(self, tenant_config_store):
self.config_store = tenant_config_store # your DB/cache
self._pools: Dict[str, redis.ConnectionPool] = {}
self._clients: Dict[str, redis.Redis] = {}
def _get_client(self, tenant_id: str) -> redis.Redis:
if tenant_id not in self._clients:
config = self.config_store.get(tenant_id)
if not config:
raise ValueError(f"Unknown tenant: {tenant_id}")
# Determine Redis URL
if config.tier == "enterprise" and config.dedicated_redis_url:
redis_url = config.dedicated_redis_url
else:
redis_url = self.TIER_REDIS[config.tier]
pool_size = self.TIER_POOL_SIZE.get(config.tier, 5)
# Create isolated pool for this tenant
pool = redis.ConnectionPool.from_url(
redis_url,
max_connections=pool_size,
socket_keepalive=True,
socket_timeout=0.5, # fail fast if Redis is slow
retry_on_timeout=False
)
self._pools[tenant_id] = pool
self._clients[tenant_id] = redis.Redis(connection_pool=pool)
return self._clients[tenant_id]
def get_key(self, tenant_id: str, user_id: str, endpoint: str) -> str:
"""
Namespace keys by tenant to prevent cross-tenant key collisions.
Even on shared Redis instances, keys are isolated.
"""
prefix = self.TIER_KEY_PREFIX.get(
self.config_store.get(tenant_id).tier, "t"
)
window = int(time.time() // 60)
return f"{prefix}:{tenant_id}:{user_id}:{endpoint}:{window}"
def is_allowed(self, tenant_id: str, user_id: str, endpoint: str) -> dict:
try:
client = self._get_client(tenant_id)
config = self.config_store.get(tenant_id)
key = self.get_key(tenant_id, user_id, endpoint)
pipe = client.pipeline(transaction=False)
pipe.incr(key)
pipe.expire(key, 120)
count, _ = pipe.execute()
return {
"allowed": count <= config.rate_limit,
"count": count,
"limit": config.rate_limit,
"tenant": tenant_id
}
except redis.ConnectionError as e:
# Tenant-isolated failure: only this tenant is affected
return {"allowed": True, "mode": "fail_open", "error": str(e)}When to Use
- Multi-tenant SaaS where any enterprise customer exists
- Systems where you have had or fear "noisy neighbor" Redis saturation
- When per-tenant SLA guarantees are required
- When tenants are at meaningfully different tiers (free vs enterprise)
Pattern 8: Idempotency Shield
Problem Context
A mobile client makes an API call that times out after 3 seconds. The client retries 3 times. The server actually processed the original request but the response was lost in transit. Result: the operation runs 4 times AND the user is charged 4 rate limit tokens for what is logically 1 operation.
Solution
Combine rate limiting with idempotency keys. The first call with an idempotency key is rate-limited normally. Subsequent calls with the same idempotency key within the TTL window return the cached result WITHOUT consuming additional rate limit tokens.
Implementation
import hashlib
import json
import time
class IdempotencyShield:
"""
Rate limiter that is idempotency-key aware.
Retries of the same logical operation do not consume additional rate limit tokens.
"""
IDEMPOTENCY_TTL = 86400 # Idempotency window: 24 hours
def __init__(self, redis_client, base_limiter):
self.redis = redis_client
self.base_limiter = base_limiter
def extract_idempotency_key(self, request) -> str | None:
"""
Accept idempotency key from standard header locations.
"""
for header in ["Idempotency-Key", "X-Idempotency-Key", "X-Request-Id"]:
key = request.headers.get(header)
if key and 8 <= len(key) <= 128:
return key
return None
def handle(self, request, user_id: str, handler_fn) -> dict:
idempotency_key = self.extract_idempotency_key(request)
if idempotency_key:
# Check if we've seen this key before
stored_key = f"idempotency:{user_id}:{idempotency_key}"
cached = self.redis.get(stored_key)
if cached:
# Return cached response WITHOUT consuming rate limit tokens
cached_response = json.loads(cached)
return {
**cached_response,
"idempotency": "replay",
"rate_limit_consumed": False
}
# First time seeing this request (or no idempotency key)
# Consume rate limit token normally
rl_result = self.base_limiter.check(user_id)
if not rl_result["allowed"]:
return {"allowed": False, "rate_limited": True, **rl_result}
# Execute the actual handler
try:
response = handler_fn(request)
# Cache the response if idempotency key provided
if idempotency_key:
stored_key = f"idempotency:{user_id}:{idempotency_key}"
self.redis.setex(
stored_key,
self.IDEMPOTENCY_TTL,
json.dumps({
"response": response,
"timestamp": time.time(),
"rate_limit_consumed": True
})
)
return {"allowed": True, "response": response, "idempotency": "new"}
except Exception as e:
# Don't cache failures - retry should be allowed
raise
def validate_idempotency_key(self, key: str) -> bool:
"""Prevent key injection attacks."""
if not key:
return False
allowed_chars = set("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_.")
return all(c in allowed_chars for c in key) and 8 <= len(key) <= 128When to Use
- Payment, order placement, or any non-idempotent operations
- Mobile clients that retry on timeout (network is unreliable)
- Any endpoint where double-execution has real consequences
- APIs where you already use idempotency keys for correctness
Pattern 9: Sidecar Enforcer
Problem Context
You have 20 microservices. Adding rate limiting logic to each one creates maintenance sprawl. A shared rate limiting library updates break all services simultaneously. You want rate limiting to be infrastructure concern, not a code concern.
Solution
Deploy rate limiting as a sidecar proxy (Envoy or Nginx) that intercepts all traffic before it reaches the service container. The service has zero rate limiting code. The sidecar handles enforcement, observability, and routing to a global rate limit service.
Architecture Diagram
┌─────────────────────────────────────────────────────┐
│ POD (k8s) │
│ ┌──────────────┐ ┌────────────────────┐ │
│ │ ENVOY │ │ Service Container │ │
│ │ SIDECAR │──────▶ │ (no rate limit │ │
│ │ │ │ code) │ │
│ │ Intercepts │ └────────────────────┘ │
│ │ all traffic │ │
│ └──────┬───────┘ │
└─────────┼───────────────────────────────────────────┘
│ gRPC
▼
┌─────────────────────┐
│ Global Rate Limit │
│ Service │
│ (Lyft/Envoy impl) │
│ │
│ ┌───────────────┐ │
│ │ Redis Cluster │ │
│ └───────────────┘ │
└─────────────────────┘
Implementation (Envoy + Rate Limit Service)
# envoy.yaml - Sidecar configuration injected by Istio or manually
static_resources:
clusters:
- name: rate_limit_service
type: STRICT_DNS
connect_timeout: 0.5s
http2_protocol_options: {}
load_assignment:
cluster_name: rate_limit_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: rate-limit-service.default.svc.cluster.local
port_value: 8081
listeners:
- name: inbound_listener
address:
socket_address:
address: 0.0.0.0
port_value: 8080
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress
http_filters:
- name: envoy.filters.http.ratelimit
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.ratelimit.v3.RateLimit
domain: my-service
failure_mode_deny: false # fail-open when rate limit service unavailable
rate_limit_service:
grpc_service:
envoy_grpc:
cluster_name: rate_limit_service
transport_api_version: V3
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
route_config:
virtual_hosts:
- name: local_service
domains: ["*"]
routes:
- match: { prefix: "/" }
route: { cluster: local_service }
rate_limits:
- actions:
- request_headers:
header_name: "x-user-id"
descriptor_key: "user_id"
- actions:
- header_value_match:
descriptor_value: "payment"
headers:
- name: ":path"
prefix_match: "/api/v1/payment"# rate-limit-service-config.yaml (Lyft/Envoy ratelimit service)
domain: my-service
descriptors:
# Per-user: 1000 requests/minute
- key: user_id
rate_limit:
unit: MINUTE
requests_per_unit: 1000
# Per-user on payment endpoints: 10 requests/minute
- key: user_id
descriptors:
- key: header_match
value: payment
rate_limit:
unit: MINUTE
requests_per_unit: 10When to Use
- Large microservices organizations (10+ services)
- When you use Kubernetes with Istio or direct Envoy deployment
- When rate limiting should be a platform concern, not a team concern
- When you want zero rate-limiting code in your services
When NOT to Use
- Small teams or few services (complexity isn't worth it)
- When you need business-logic-aware limits (sidecar has no access to DB)
- Monolithic applications
Pattern 10: Hybrid Approximate
Problem Context
At 100,000 RPS with Redis, the rate limiter itself becomes the bottleneck. Redis is receiving 100,000 INCR operations per second, consuming significant CPU. Network RTT adds 2ms to every single request. You need to dramatically reduce Redis load while keeping limits reasonably accurate.
Solution
Each application instance maintains a local token bucket. Periodically (every N seconds), it synchronizes with Redis to get the global count and returns/replenishes its local allocation. The local bucket absorbs burst traffic. Redis only handles inter-instance coordination.
Architecture Diagram
App Server 1 App Server 2 App Server 3
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Local │ │ Local │ │ Local │
│ Bucket │ │ Bucket │ │ Bucket │
│ 33 tokens│ │ 33 tokens│ │ 33 tokens│
│ (1/3 of │ │ (1/3 of │ │ (1/3 of │
│ 100/min) │ │ 100/min) │ │ 100/min) │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
│ Every 1 sec: sync with Redis │
└───────────────┼───────────────┘
│
┌────▼─────┐
│ Redis │
│ Global │
│ Counter │
└──────────┘
Redis receives: 3 syncs/sec instead of 100,000 ops/sec (99.997% reduction)
Accuracy tradeoff: allows up to 10-20% over limit during 1-second sync window
Implementation
import threading
import time
import redis
class HybridApproximateLimiter:
"""
High-throughput rate limiter using local buckets + periodic Redis sync.
Reduces Redis load by (RPS / sync_freq) factor.
Default: sync every 1 second. At 100K RPS: 100,000x -> ~10 Redis ops/sec per user.
"""
def __init__(
self,
redis_client: redis.Redis,
global_limit: int,
window_seconds: int = 60,
sync_interval: float = 1.0, # seconds between Redis syncs
num_instances: int = 10, # total application instances (approximate)
):
self.redis = redis_client
self.global_limit = global_limit
self.window_seconds = window_seconds
self.sync_interval = sync_interval
self.local_quota = global_limit // num_instances # allocation per instance
# Per-user local state
self._local_counts: dict[str, dict] = {}
self._lock = threading.Lock()
# Start background sync thread
self._sync_thread = threading.Thread(target=self._sync_loop, daemon=True)
self._sync_thread.start()
def is_allowed(self, identifier: str) -> bool:
with self._lock:
now = time.time()
window_key = int(now // self.window_seconds)
if identifier not in self._local_counts:
self._local_counts[identifier] = {
"window": window_key,
"local_used": 0,
"local_quota": self.local_quota,
"global_used": 0
}
state = self._local_counts[identifier]
# New window: reset
if state["window"] != window_key:
state["window"] = window_key
state["local_used"] = 0
state["local_quota"] = self.local_quota
state["global_used"] = 0
# Check local quota first (no Redis involved)
if state["local_used"] >= state["local_quota"]:
return False # Local quota exhausted, wait for sync
state["local_used"] += 1
return True
def _sync_loop(self):
"""Background thread: periodically sync local counts to Redis."""
while True:
time.sleep(self.sync_interval)
self._sync_all()
def _sync_all(self):
with self._lock:
now = time.time()
window_key = int(now // self.window_seconds)
identifiers = list(self._local_counts.keys())
for identifier in identifiers:
self._sync_one(identifier, window_key)
def _sync_one(self, identifier: str, window_key: int):
"""
Push local count to Redis, get global count back,
recalculate local quota.
"""
with self._lock:
if identifier not in self._local_counts:
return
state = self._local_counts[identifier]
if state["window"] != window_key:
return
local_delta = state["local_used"] - state.get("last_synced", 0)
state["last_synced"] = state["local_used"]
if local_delta <= 0:
return
redis_key = f"hybrid:{identifier}:{window_key}"
try:
pipe = self.redis.pipeline()
pipe.incrby(redis_key, local_delta)
pipe.expire(redis_key, self.window_seconds * 2)
results = pipe.execute()
global_count = results[0]
# Recalculate remaining quota for this instance
remaining_global = max(0, self.global_limit - global_count)
# Assume equal distribution among instances
new_local_quota = max(1, remaining_global // 10) # rough estimate
with self._lock:
if identifier in self._local_counts:
state = self._local_counts[identifier]
state["global_used"] = global_count
state["local_quota"] = min(new_local_quota, self.local_quota)
if global_count >= self.global_limit:
state["local_quota"] = 0 # Global limit reached, stop local
except redis.RedisError:
pass # Continue with existing local quota on Redis errorWhen to Use
- Extreme-throughput systems (>50K RPS per user class)
- When Redis latency overhead is measurably impacting p99 response time
- When approximate limits (±15%) are acceptable for the use case
- Internal high-frequency APIs, analytics ingest, logging endpoints
When NOT to Use
- Security-critical endpoints (±15% over-limit is unacceptable for payments/auth)
- When exact quota accounting is required for billing
- Low-RPS systems where Redis overhead isn't an issue
ADR Template Reference
Use this template for documenting your own rate limiting architecture decisions.
# ADR-NNN: [Title]
**Date:** YYYY-MM-DD
**Status:** [Proposed | Accepted | Deprecated | Superseded by ADR-NNN]
**Deciders:** [List of people involved in the decision]
## Context
[What is the situation, constraint, or problem that requires a decision?
Be specific about the scale, users, endpoints, and failure modes involved.]
## Decision Drivers
- [Driver 1: e.g., "Must support 100K RPS with <5ms overhead"]
- [Driver 2: e.g., "No Redis license budget - must use DynamoDB"]
- [Driver 3: e.g., "Team has no distributed systems experience"]
## Options Considered
### Option A: [Name]
- Pros: ...
- Cons: ...
### Option B: [Name]
- Pros: ...
- Cons: ...
## Decision
[We chose Option X because... Reference the decision drivers explicitly.]
## Consequences
**Positive:**
- [Consequence 1]
**Negative / Accepted Trade-offs:**
- [Trade-off 1 and why it's acceptable]
## Implementation Notes
[Key technical details, configuration values, or gotchas to be aware of.]
## Review Date
[When should this decision be revisited? e.g., "When traffic exceeds 1M RPS"]Pattern Comparison Matrix
| Pattern | Best For | Complexity | Latency Overhead | Accuracy | Redis Dependency |
|---|---|---|---|---|---|
| Gateway Sentinel | Multi-service orgs | Low | Medium | High | Yes |
| Layered Defense | Public APIs, DDoS risk | Medium | Low (CDN/Nginx = 0ms) | High | Optional at each layer |
| Quota Cascade | B2B multi-tenant | High | Medium | Exact | Yes |
| Shadow Enforcement | Adding RL to live system | Medium | None (shadow mode) | N/A | Yes |
| Adaptive Throttle | High-traffic, failure-prone | High | Low | Proportional | Yes |
| Cost-Weighted Bucket | GraphQL, ML, complex ops | High | Medium | By cost model | Yes |
| Tenant-Isolated Pool | Multi-tenant SaaS | Medium | Low | High | Yes (per-tenant) |
| Idempotency Shield | Payment, mobile, retries | Medium | Low | Exact | Yes |
| Sidecar Enforcer | Microservices, k8s | High (infra) | Low | High | Yes |
| Hybrid Approximate | Extreme RPS (>50K) | High | Near-zero | Approximate (±15%) | Yes (sync only) |
Series Complete. All supplements: