Rate Limiting - Supplement 3: Trade-Offs and Decision Guide
Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 4 - Architecture Patterns
A comprehensive decision guide for every major rate limiting choice.
Each section answers "when should I choose A over B?" with concrete criteria,
trade-off tables, and real-world context.
Table of Contents
- When to Rate Limit (and When NOT to)
- Where to Place the Rate Limiter: Full Decision Matrix
- Algorithm Trade-Offs: Deep Comparison
- Storage Backend Trade-Offs
- Centralized vs Decentralized vs Hybrid
- Accuracy vs Memory vs Latency Triangle
- Hard Limit vs Soft Limit vs Throttle vs Queue
- Per-User vs Per-IP vs Per-Key vs Global
- Fail-Open vs Fail-Closed: The Full Analysis
- Rate Limiting Library vs Build Your Own
- Infrastructure Cost Comparison
- Trade-Off Decision Trees
1. When to Rate Limit (and When NOT to)
Always Rate Limit These
| Scenario | Reason | Minimum Recommendation |
|---|---|---|
| Any public-facing API | Unknown clients, potential abuse | Always |
| Authentication endpoints (login, password reset) | Brute force prevention | 5-10 per minute per IP+username |
| Payment / financial operations | Fraud prevention, cost control | 10-25 per minute per user |
| Expensive compute endpoints (ML, search, reports) | CPU/GPU cost protection | 2-10 per hour per user |
| File upload/download endpoints | Bandwidth cost control | 50-100 per hour per user |
| Webhook/notification sending | Downstream service protection | 10-100 per minute per destination |
| Email/SMS sending endpoints | Spam prevention + cost | 10-30 per hour per user |
| Third-party API calls you relay | Respect upstream limits | Match upstream limit |
| WebSocket connections | Connection pool protection | 5-10 concurrent per user |
Rate Limit Carefully (Context Dependent)
| Scenario | When to Limit | When to Exempt or Raise |
|---|---|---|
| Internal service-to-service calls | When called service is shared and expensive | When caller is the only consumer |
| Batch processing endpoints | Always, but with much higher limits | Never exempt entirely |
| Read-only cached endpoints | IP-based flood protection only | Authenticated reads can be very high |
| Admin endpoints | Yes - even admins can have bugs | Admins get higher limits, not none |
| Health check endpoints | Never - must always be accessible | Always exempt from rate limits |
| Metrics/monitoring endpoints | Never - monitoring must always work | Always exempt |
When NOT to Rate Limit (or Exempt Entirely)
DO NOT rate limit:
- /health, /ping, /ready, /live (k8s probes, load balancer health checks)
- /metrics (Prometheus scraping)
- Internal monitoring agents
- Your own CDN origin pull requests
- Pre-flight CORS OPTIONS requests (or use very high limit)
Why: These endpoints being rate limited causes:
- Load balancers removing healthy instances from rotation
- Kubernetes killing pods that fail liveness/readiness probes
- Monitoring going blind right when you need it most (during incidents)
- CDN cache poisoning when origin cannot be reached for cache refresh
Implementation:
# Exemption list - checked BEFORE rate limiting
EXEMPT_PATHS = frozenset([
"/health",
"/ping",
"/ready",
"/live",
"/metrics",
"/favicon.ico",
])
EXEMPT_PATH_PREFIXES = frozenset([
"/internal/monitoring/",
"/actuator/", # Spring Boot actuator
"/_ah/", # Google App Engine health
])
def should_rate_limit(request) -> bool:
path = request.path
if path in EXEMPT_PATHS:
return False
if any(path.startswith(p) for p in EXEMPT_PATH_PREFIXES):
return False
return True2. Where to Place the Rate Limiter: Full Decision Matrix
The Seven-Layer Model
Layer 1: DNS / GeoDNS - Geographic blocking, routing
Layer 2: CDN / Edge (Cloudflare, Akamai, Fastly) - DDoS, IP blocking
Layer 3: Load Balancer (Nginx, HAProxy, AWS ALB) - IP rate limiting
Layer 4: API Gateway (Kong, AWS API GW, Apigee) - API key / user limits
Layer 5: Application Middleware / Filter - Business logic limits
Layer 6: Service Mesh Sidecar (Envoy, Istio) - Service-to-service limits
Layer 7: Database / Queue / External Service - Resource-level limits
Layer-by-Layer Trade-Off Analysis
Layer 2: CDN / Edge
| Dimension | Value |
|---|---|
| Latency added | 0ms (happens before request reaches origin) |
| Context available | IP address, HTTP method, path, headers |
| Context NOT available | User identity, session, business logic |
| Best for | DDoS mitigation, geographic blocking, bot filtering, IP floods |
| Worst for | User-level limits, subscription tier enforcement |
| Cost to configure | Low (rules-based UI) |
| Cost if wrong | Medium (can block legitimate users) |
| Example: Cloudflare Rule | "Block IP if >1000 requests/5 minutes to /api/*" |
Decision: Use Layer 2 if...
You have a public-facing API with unknown clients
You are being DDoS'd or scraped at high volume
You need to block a geographic region for compliance
You want defense before traffic reaches your servers
Do NOT rely on Layer 2 alone if...
You need per-user or per-API-key rate limiting
You need business logic (user tier, endpoint cost)
You have users behind shared IPs (corporate, mobile)
Layer 3: Load Balancer
| Dimension | Value |
|---|---|
| Latency added | 0-1ms |
| Context available | IP, port, HTTP method, path |
| Context NOT available | Auth tokens, user identity, business context |
| Best for | Per-IP rate limiting across all upstream servers |
| Worst for | User-aware limits |
| Configuration complexity | Low (Nginx config) |
| State storage | Nginx shared memory (local only) or NLua+Redis |
# Nginx: Per-IP, per-endpoint rate limiting
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=api:10m rate=100r/m;
limit_req_zone $http_x_api_key zone=apikey:20m rate=1000r/m;
location /api/auth/login {
limit_req zone=login burst=3 nodelay;
limit_req_status 429;
proxy_pass http://backend;
}
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://backend;
}When Nginx rate limiting is sufficient (no Redis needed):
- IP-based limits only
- Simple DDoS prevention
- No per-user or per-key logic needed
- Single load balancer (no need to share state)
Layer 4: API Gateway
| Dimension | Value |
|---|---|
| Latency added | 1-10ms (extra hop) |
| Context available | API keys, routes, consumers, plans |
| Context NOT available | Business logic (user tier in your DB) |
| Best for | API product monetization, developer portals, standard SaaS API |
| Worst for | Complex business logic limits, per-endpoint cost analysis |
| State storage | Gateway's own Redis / database |
Choose API Gateway (Layer 4) when:
- You are building an API product (developers are your customers)
- You have multiple APIs and want one rate limit policy
- You want self-service developer portal with usage dashboards
- You are on a managed platform (AWS API Gateway, Azure APIM)
Choose Application layer (Layer 5) instead when:
- Your rate limits depend on business logic (tier from your DB)
- Rate limits vary per endpoint based on computational cost
- You want fine-grained control without vendor lock-in
- Your API is internal (not developer-facing)
Layer 5: Application Middleware
| Dimension | Value |
|---|---|
| Latency added | 1-20ms (Redis round trip) |
| Context available | Everything (user, tier, session, business state) |
| State storage | Redis (required for distributed) |
| Best for | Fine-grained business-logic-aware rate limiting |
| Worst for | Very high RPS where every ms matters |
| Flexibility | Highest - full code control |
| Maintenance | Your team owns it |
# Full context available at Layer 5
def rate_limit_middleware(request: Request) -> Optional[Response]:
user = request.state.user
endpoint = request.url.path
http_method = request.method
# Business context available:
limit = get_limit_for(
tier=user.subscription_tier, # from your DB (cached in Redis)
endpoint=endpoint,
method=http_method,
cost=calculate_endpoint_cost(endpoint, request.json()),
is_premium=user.is_premium,
trust_level=user.trust_level
)
return check_and_respond(request, limit)Layer 6: Service Mesh
| Dimension | Value |
|---|---|
| Latency added | 0-2ms (sidecar is local) |
| Context available | Service identity, request metadata |
| Context NOT available | Business logic |
| Best for | Inter-service rate limiting, protecting services from other services |
| Worst for | Per-user or per-consumer limits |
| Requires | Istio, Envoy, or Linkerd deployed |
# Istio: Service A can only send 100 RPS to Service B
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
name: service-a-to-b-ratelimit
spec:
configPatches:
- applyTo: HTTP_FILTER
match:
context: SIDECAR_OUTBOUND
cluster:
service: service-b.default.svc.cluster.local
patch:
operation: INSERT_BEFORE
value:
name: envoy.filters.http.local_ratelimit
typed_config:
token_bucket:
max_tokens: 100
tokens_per_fill: 100
fill_interval: 1sWhere-to-Place Decision Matrix
| Requirement | Layer 2 CDN | Layer 3 LB | Layer 4 GW | Layer 5 App | Layer 6 Mesh |
|---|---|---|---|---|---|
| DDoS protection | Best | Good | Poor | Poor | N/A |
| IP-based rate limiting | Best | Best | Good | Good | N/A |
| User-based rate limiting | No | No | Partial | Best | No |
| API key rate limiting | Limited | Limited | Best | Best | No |
| Subscription tier limits | No | No | Partial | Best | No |
| Service-to-service limits | No | No | No | Good | Best |
| GraphQL cost limiting | No | No | No | Best | No |
| Custom business logic | No | No | No | Best | No |
| Latency overhead | None | Minimal | Low-Medium | Medium | Low |
| Vendor lock-in risk | High | Low | Medium-High | None | Medium |
Recommended combination for most production systems:
Layer 2 (CDN) : DDoS + IP flood protection
Layer 3 (Nginx) : Rate limit at 10x the app limit (backstop)
Layer 5 (App) : Per-user business logic limits with Redis
Skip Layer 4 (API Gateway) unless you are building a developer platform.
Skip Layer 6 (Service Mesh) unless you are in a mature microservices org.
3. Algorithm Trade-Offs: Deep Comparison
Fixed Window vs Sliding Window Counter
| Criterion | Fixed Window | Sliding Window Counter |
|---|---|---|
| Memory | 1 counter per window per user | 2 counters per user (current + previous) |
| Accuracy | Lower (boundary bug) | High (~0.1% error) |
| Implementation complexity | Very low | Low |
| Redis commands | INCR, EXPIRE | INCR x2, GET x2, EXPIRE x2 (Lua) |
| Max spike at boundary | 2x the limit | ~1.05x the limit |
| Appropriate for... | Internal APIs, quick protects | Most production APIs |
Choose Fixed Window when:
- Simplicity is paramount (prototype, quick fix)
- The boundary window attack is acceptable (internal, trusted callers)
- Memory is extremely constrained
- The limit is loose enough that 2x burst is fine
Choose Sliding Window Counter when:
- External or public-facing API
- Accurate limit enforcement matters
- Building a developer platform where customers count exact requests
- Limit is tight (e.g., 5 requests/minute for a sensitive endpoint)
Token Bucket vs Leaky Bucket
| Criterion | Token Bucket | Leaky Bucket |
|---|---|---|
| Burst support | YES - full capacity burst | NO - always fixed rate |
| Output smoothness | Variable (bursty) | Perfectly smooth |
| Memory | O(1): tokens + last_refill | O(capacity): queue |
| Complexity | Low | Medium |
| Latency added | None (instant decision) | Adds latency (request waits in queue) |
| CPU impact | None | Queue management overhead |
| Appropriate for... | User-facing APIs | Traffic shaping, DB protection |
Choose Token Bucket when:
- Users naturally have bursty access patterns (open app, load feed = 20 requests at once)
- You want to allow short bursts while limiting sustained throughput
- Low-latency decisions are required (no queuing)
- This is the correct choice for 90% of API rate limiting
Choose Leaky Bucket when:
- You are protecting a downstream service that MUST receive smooth traffic
(e.g., a payment gateway that fails on spikes, an ML inference service with strict SLA) - You are doing network traffic shaping (not HTTP APIs)
- You want to convert bursty inbound traffic into smooth outbound traffic
Sliding Window Log vs Sliding Window Counter
| Criterion | Sliding Window Log | Sliding Window Counter |
|---|---|---|
| Memory | O(limit) per user | O(1) per user |
| Accuracy | Perfect | ~0.1% error |
| Redis structure | Sorted Set (ZADD) | String (GET/INCR) |
| Redis commands per request | ZREMRANGEBYSCORE, ZADD, ZCARD | GET x2, INCR, EXPIRE x2 (Lua) |
| Memory at scale (limit=1000, 1M users) | ~8 GB | ~30 MB |
| Appropriate for... | Low-limit, high-security endpoints | General API endpoints |
Choose Sliding Window Log when:
- Limit is very low (5-20 requests/minute) so memory cost is negligible
- Perfect accuracy is required (authentication, payment, compliance)
- You can afford the memory cost
Choose Sliding Window Counter when:
- Limit is high (100+ per minute) and memory matters
- You have many users (100K+)
- 0.1% accuracy margin is acceptable (it is for 99% of use cases)
4. Storage Backend Trade-Offs
Redis vs DynamoDB vs In-Memory vs PostgreSQL
| Property | Redis | DynamoDB | In-Memory | PostgreSQL |
|---|---|---|---|---|
| Latency | 0.1-2ms | 1-10ms | <0.1ms | 2-20ms |
| Throughput | 100K-1M ops/sec | 40K-400K WCU/sec | Millions/sec | 10K-100K/sec |
| Consistency | Strong (single node) | Eventual (default) | Local only | Strong |
| Persistence | Configurable (RDB/AOF) | Always persistent | No (lost on restart) | Always |
| Auto-expiry | YES (EXPIRE command) | YES (TTL attribute) | Via code | Via cron |
| Distributed | Yes (Cluster) | Yes (managed) | No | Yes (but slow) |
| Atomic operations | Yes (INCR, Lua) | Conditional writes | Yes (locked) | Yes (transactions) |
| Cost at scale | Infrastructure cost | Per-operation cost | Cheapest | Infrastructure |
| Managed service | ElastiCache | DynamoDB | N/A | RDS |
Choose Redis when:
- Highest throughput requirement (>10K RPS rate limit checks)
- Low latency is critical (adds <2ms to request)
- You need Lua scripting for complex atomic operations
- You already have Redis in your infrastructure
Choose DynamoDB when:
- You are all-in on AWS serverless
- You want zero infrastructure management
- You can tolerate ~5ms rate limit check latency
- You prefer pay-per-use pricing (no Redis instance to manage)
DynamoDB rate limiter:
import boto3
from boto3.dynamodb.conditions import Attr
import time
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("rate_limits")
def is_allowed_dynamodb(user_id: str, limit: int, window: int) -> bool:
now = int(time.time())
window_id = now // window
expires_at = (window_id + 2) * window # TTL for DynamoDB auto-cleanup
try:
response = table.update_item(
Key={"pk": f"rl:{user_id}:{window_id}"},
UpdateExpression="SET #cnt = if_not_exists(#cnt, :zero) + :one, #ttl = :ttl",
ConditionExpression=Attr("#cnt").lt(limit) | Attr("#cnt").not_exists(),
ExpressionAttributeNames={"#cnt": "count", "#ttl": "ttl"},
ExpressionAttributeValues={":zero": 0, ":one": 1, ":ttl": expires_at},
ReturnValues="UPDATED_NEW"
)
return True
except dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
return False # limit exceededChoose In-Memory when:
- Single instance deployment
- Sticky sessions guaranteed
- Rate limit state loss on restart is acceptable
- Maximum possible throughput needed (local ML inference services, etc.)
5. Centralized vs Decentralized vs Hybrid
Full Trade-Off Analysis
| Property | Centralized (Redis) | Decentralized (Local) | Hybrid |
|---|---|---|---|
| Accuracy | Exact | Per-server (N instances = N*limit) | Approximate (configurable) |
| Latency overhead | +1-5ms (Redis RTT) | 0ms (in-process) | <1ms avg |
| Failure mode | Redis down = fail-open | Server down = unaffected | Degrades to local |
| Implementation | Moderate | Simple | Complex |
| Works with horizontal scale | Yes | No (each server is independent) | Yes |
| Works with serverless | Yes | Partially (per-container) | Yes |
| Memory efficiency | Centralized | Distributed (duplicated) | Centralized with local cache |
When to Use Each
Centralized:
Use when:
- Exact enforcement is required (payments, authentication, compliance)
- You have more than 2 application instances
- You need consistent limits across all servers
- You already have Redis infrastructure
Acceptable drawback:
- +1-5ms per request (this is the price of accuracy)
Decentralized (Local):
Use when:
- Single-instance service (dev, small internal tool)
- Rate limiting the SERVICE (not individual users) - "protect this server"
- Outbound rate limiting (your service calling external APIs)
- You cannot afford Redis infrastructure
Not acceptable when:
- Multiple instances (limits multiply by instance count)
- You need accurate per-user enforcement
Hybrid:
Use when:
- Very high RPS where Redis latency is a concern (>50K RPS)
- You want to reduce Redis load by ~80%
- Approximate limits are acceptable (allow up to 10% over-limit)
Example at 100K RPS, limit=100/min, 10 servers:
Decentralized: each server enforces 100/min -> effective = 1000/min (wrong!)
Centralized: 100K Redis ops/sec (Redis CPU: ~60%)
Hybrid: Each server reserves 30 locally, checks global every 1 second
-> 10K Redis sync ops/sec instead of 100K (90% Redis load reduction)
-> Accuracy: within 30% of limit (acceptable for non-critical)
6. Accuracy vs Memory vs Latency Triangle
There is a fundamental triangle trade-off. You cannot optimize all three simultaneously.
ACCURACY
(exact limits)
/\
/ \
/ \
/ \
/ \
/ Pick \
/ Any 2 \
/______________\
MEMORY LATENCY
(O(1) per user) (minimal overhead)
The Trade-Off Explained
Accuracy + Low Memory, Sacrificing Latency:
- Algorithm: Centralized Redis Sliding Window Counter (Lua script)
- Memory: O(1) per user (good)
- Accuracy: ~0.1% error (good)
- Latency: +2-5ms per request (Redis round trip, cannot avoid)
Accuracy + Low Latency, Sacrificing Memory:
- Algorithm: In-process Sliding Window Log
- Memory: O(limit) per user (poor at scale)
- Accuracy: Perfect
- Latency: Sub-millisecond (in-process, no Redis)
Low Memory + Low Latency, Sacrificing Accuracy:
- Algorithm: In-process Token Bucket (no Redis)
- Memory: O(1) per user (good)
- Accuracy: Per-server only (poor for distributed)
- Latency: Sub-millisecond
The Sweet Spot for Most Production Systems:
Algorithm: Redis Sliding Window Counter (Lua script)
Accept: +2ms latency for Redis
Get: High accuracy (~0.1% error), O(1) memory, works distributed
This covers 95% of production use cases. Only optimize further when:
- Latency SLA is extremely tight (<10ms total response time)
- Scale is extreme (>100K RPS)
Quantifying the Trade-Offs
At 10,000 RPS with 100,000 users, limit=100/min:
Option 1: Redis Sliding Window Counter (Lua)
Memory: 100,000 x 2 keys x 30 bytes = 6 MB Redis
Latency: +2ms per request (Redis RTT)
Accuracy: 99.9% (0.1% error from weighting)
Cost: 1 Redis node, ~$50-100/month
Option 2: Redis Sliding Window Log (Sorted Set)
Memory: 100,000 x 100 entries x 50 bytes = 500 MB Redis
Latency: +3ms per request (more Redis commands)
Accuracy: 100% (perfect)
Cost: 1 Redis node with more RAM, ~$150-200/month
Option 3: In-Process Token Bucket
Memory: 100,000 x 16 bytes = 1.6 MB per server x 10 servers = 16 MB
Latency: +0.01ms (in-process)
Accuracy: ~10% error (each server handles 10% of traffic, limit*10)
Cost: $0 extra (no Redis needed)
Problem: Users can make 1000/min instead of 100/min
Option 4: Hybrid (Local + Redis)
Memory: 16 MB local + 6 MB Redis
Latency: +0.01ms for 90% of requests, +2ms for 10%
Accuracy: ~95% (allows up to 30% over-limit in edge cases)
Cost: 1 Redis node, ~$50/month
Best for: Non-critical endpoints at extreme scale
7. Hard Limit vs Soft Limit vs Throttle vs Queue
When to Use Each Response Strategy
| Strategy | Behavior | Response | When to Use |
|---|---|---|---|
| Hard Limit | Reject immediately | 429 | Security-critical, payment APIs |
| Soft Limit | Allow with warning | 200 + warning header | Developer APIs, gradual enforcement |
| Throttle (delay) | Queue and delay | 200 (after wait) | Background jobs, batch processing |
| Queue (async) | Accept, process later | 202 Accepted | Long-running ops, webhook dispatch |
| Shed (degrade) | Return cached/degraded | 200 (partial) | High availability priority |
Decision Framework
Question 1: Can the request be safely rejected?
If NO (e.g., user is in a checkout flow, payment in flight):
Use Soft Limit or Queue, not Hard Limit
If YES:
Question 2: Is this a security-critical endpoint?
If YES (login, payment, delete): Hard Limit
If NO: Question 3
Question 3: Is the caller a human or a machine?
Human: Use Soft Limit (warn before cutting off)
Machine: Use Hard Limit or Throttle (machines should handle 429)
Question 4: Can the request be deferred?
YES (report generation, bulk export, email send): Use Queue (202 Accepted)
NO (real-time query, user is waiting): Hard Limit
Implementing Degraded Response (Graceful Degradation)
class GracefulRateLimiter:
"""
Instead of hard-rejecting, return stale or partial data when over limit.
Use only for read endpoints where stale data is acceptable.
"""
def __init__(self, limiter, cache):
self.limiter = limiter
self.cache = cache
def handle_request(self, request, user_id: str) -> Response:
result = self.limiter.check(user_id)
if result["allowed"]:
# Normal path: fresh data
data = fetch_fresh_data(request)
self.cache.set(request.path, data, ttl=60)
return Response(data, headers=self._rl_headers(result))
# Over limit: try cached/stale data
stale = self.cache.get_stale(request.path)
if stale:
return Response(
stale,
headers={
**self._rl_headers(result),
"X-Cache": "STALE",
"X-RateLimit-Degraded": "true",
"Warning": "199 - Response may be stale due to rate limiting"
}
)
# No stale data available: hard reject
return Response(
{"error": "rate_limit_exceeded"},
status=429,
headers=self._rl_headers(result)
)8. Per-User vs Per-IP vs Per-Key vs Global
When Each Is the Right Identifier
| Identifier | Granularity | Authentication Required | Best For | Pitfalls |
|---|---|---|---|---|
| Per-User ID | Individual | Yes | All authenticated APIs | None (best option) |
| Per-API Key | Per credential | Yes (key-based) | Developer APIs, M2M | Key leakage, shared keys |
| Per-IP | IP-level | No | Unauthenticated endpoints | NAT, CGN, proxies |
| Per-IP + User-Agent | Better than IP | No | Unauthenticated + bot detect | Easily spoofed |
| Per-IP Subnet (/24) | Subnet-level | No | CGN, corporate networks | Whole company impacted |
| Global | System-wide | No | Infrastructure protection | Unfair (one user affects all) |
The Layered Identifier Strategy
def get_rate_limit_identifiers(request) -> list[tuple[str, int]]:
"""
Return list of (identifier, limit) pairs.
ALL must pass for the request to be allowed.
Each provides a different layer of protection.
"""
identifiers = []
window = 60 # 1 minute window for all
# Layer 1: Per-user (most specific, highest limit)
user_id = extract_user_id(request)
if user_id:
tier_limit = get_tier_limit(user_id)
identifiers.append((f"user:{user_id}", tier_limit))
# Layer 2: Per-API-key (for machine clients)
api_key = request.headers.get("X-API-Key")
if api_key:
identifiers.append((f"apikey:{hash_key(api_key)}", 5000))
# Layer 3: Per-IP (coarse, protects against unauthenticated floods)
ip = get_real_ip(request)
ip_limit = 1000 if is_cgn_ip(ip) else 200 # higher for NAT
identifiers.append((f"ip:{ip}", ip_limit))
# Layer 4: Global (system-wide protection)
identifiers.append(("global:system", 1_000_000))
return identifiersThe Fallback Chain
Best case: Authenticated user -> rate limit by user ID
Good: API key present -> rate limit by key hash
Fair: No auth, regular IP -> rate limit by IP (lower limit)
Coarse: No auth, CGN/corporate IP -> rate limit by IP with higher limit
Emergency: DDoS detected -> temporary geographic block at CDN
9. Fail-Open vs Fail-Closed: The Full Analysis
Decision Matrix by Endpoint Type
| Endpoint Type | Risk of Fail-Open | Risk of Fail-Closed | Recommendation |
|---|---|---|---|
| Public read API | Low | Medium | Fail-Open |
| User dashboard | Low | Medium | Fail-Open |
| Search API | Low | Medium | Fail-Open |
| Payment/charge API | Very High | Low | Fail-Closed |
| Authentication/login | High | Low | Fail-Closed |
| Account creation | High | Medium | Fail-Closed |
| Admin operations | High | Low | Fail-Closed |
| Password reset | High | Low | Fail-Closed |
| File upload | Medium | Medium | Local fallback |
| Report generation | Low | High (expensive) | Fail-Open with quota |
| Internal health check | N/A | Very High (monitoring blind) | Always Open |
The Local Fallback Strategy (Best of Both)
from enum import Enum
import time
class FailPolicy(Enum):
OPEN = "open" # Allow all requests when Redis down
CLOSED = "closed" # Deny all requests when Redis down
LOCAL = "local" # Use local rate limiter as fallback
class ResilientRateLimiter:
"""
Three-mode rate limiter:
- HEALTHY: Use Redis (accurate, distributed)
- DEGRADED: Use local limiter (approximate, single-instance)
- FAILED: Fail-open or fail-closed based on policy
"""
def __init__(
self,
redis_limiter,
local_limiter,
policy: FailPolicy = FailPolicy.LOCAL,
failure_threshold: int = 3,
recovery_timeout: int = 30
):
self.redis_limiter = redis_limiter
self.local_limiter = local_limiter
self.policy = policy
self.failure_count = 0
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.circuit_opened_at = None
def is_allowed(self, identifier: str) -> dict:
# Check if circuit is open (Redis failed recently)
if self.circuit_opened_at:
if time.time() - self.circuit_opened_at > self.recovery_timeout:
# Try to recover
self.circuit_opened_at = None
self.failure_count = 0
else:
return self._handle_degraded(identifier)
try:
result = self.redis_limiter.is_allowed(identifier)
self.failure_count = 0 # Reset on success
return result
except Exception as e:
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.circuit_opened_at = time.time()
# Alert!
return self._handle_degraded(identifier)
def _handle_degraded(self, identifier: str) -> dict:
if self.policy <mark class="obsidian-highlight"> FailPolicy.OPEN:
return {"allowed": True, "mode": "fail_open", "reason": "redis_unavailable"}
if self.policy </mark> FailPolicy.CLOSED:
return {"allowed": False, "mode": "fail_closed", "reason": "redis_unavailable"}
# LOCAL fallback
local_result = self.local_limiter.is_allowed(identifier)
local_result["mode"] = "local_fallback"
local_result["note"] = "approximate_limits"
return local_result10. Rate Limiting Library vs Build Your Own
Library Options by Language
| Language | Library | Algorithm | Redis-backed | Notes |
|---|---|---|---|---|
| Java | Bucket4j | Token Bucket | Yes (Lettuce/Jedis) | Best Java option, production-grade |
| Java | Resilience4j RateLimiter | Fixed Window | No | Works with CB, annotation-driven |
| Java | Guava RateLimiter | Token Bucket | No | Google Guava, local only |
| Python | Flask-Limiter | Configurable | Yes (Redis) | Flask-specific |
| Python | slowapi | Configurable | Yes (Redis) | FastAPI-native |
| Python | limits | All algorithms | Yes | Framework-agnostic |
| Node.js | express-rate-limit | Fixed/Sliding | Yes (RedisStore) | Most popular |
| Go | golang.org/x/time/rate | Token Bucket | No | Standard library, local only |
| Go | throttled | GCRA | Yes (Redis) | Similar to Stripe's approach |
| Ruby | rack-attack | Configurable | Yes (Redis) | Rack middleware |
Build vs Buy Decision
Use a library when:
- Standard use case (fixed window, token bucket, sliding window counter)
- One of the supported languages above
- Speed to production matters
- You want community support and battle-tested code
- Library supports your required storage backend
Build your own when:
- You need custom algorithm behavior not supported by any library
- You need extreme performance optimization for your specific access patterns
- You are integrating with a novel storage backend
- You have compliance requirements that restrict open-source dependency usage
- Custom rate limit logic (e.g., cost-based per-query complexity)
The real question: How much is custom logic?
Pure fixed window with Redis? -> Use library (Flask-Limiter / express-rate-limit)
Token bucket with Redis? -> Use Bucket4j or equivalent
Custom GraphQL cost analysis? -> Build the cost analyzer, use library for the bucket
Adaptive rate limiting with ML? -> Build it (no library does this)
Multi-dimensional with 5 limits? -> Build on top of library primitives
11. Infrastructure Cost Comparison
Scenario: 100,000 active users, 10,000 RPS peak, 100 requests/minute limit per user
Option A: Redis Standalone (AWS ElastiCache)
Redis instance: cache.m5.large ($0.10/hour = $72/month)
Memory needed: 100,000 users x 2 keys x 60 bytes = 12 MB
Redis CPU at 10K RPS: ~15% (well within limits)
Latency overhead: +2ms per request
Annual cost: ~$864
Option B: Redis Cluster (3 primary + 3 replica)
Redis cluster: 6x cache.t3.micro ($0.017/hour x 6 = $74/month)
Memory: 12 MB split across 3 primaries
CPU: 3% per shard
Latency: +2ms (same as standalone for this scale)
Annual cost: ~$888
Use when: Availability is critical, not when scale demands it at this size
Option C: DynamoDB On-Demand
Writes (INCR equivalent): 10,000 WCU/sec x $0.0000012 = $0.012/second = $864/month
Reads (0 - we don't read separately): $0
Storage: 12 MB x $0.25/GB/month = $0.003/month
Total: ~$864/month (similar to Redis but scales automatically)
Annual cost: ~$10,368 (vs $864 for Redis)
Use when: Serverless architecture, no Redis allowed, cost is secondary
Option D: Nginx Local (no Redis, per-IP only)
Additional infrastructure: $0 (uses existing Nginx)
Limitation: Per-IP only, no per-user limits
Use when: IP-based limits are sufficient (public site protection)
Annual cost: $0 additional
Cost Summary:
| Option | Monthly Cost | Latency | Accuracy | Scales to 1M users? |
|---|---|---|---|---|
| Redis Standalone | $72 | +2ms | 99.9% | Yes (upgrade instance) |
| Redis Cluster | $74 | +2ms | 99.9% | Yes (add shards) |
| DynamoDB On-Demand | $864 | +5ms | 99.9% | Yes (auto) |
| Nginx Local | $0 | +0ms | IP-only | No (IP-only) |
| In-Process | $0 | +0ms | Per-instance | No (N*limit) |
12. Trade-Off Decision Trees
Decision Tree 1: Choosing an Algorithm
START: What is your primary constraint?
Accuracy first? (security, payment, compliance)
-> Low user count (<10K)? -> Sliding Window Log
-> High user count (>10K)? -> Sliding Window Counter + Redis Lua (99.9% accurate)
Memory first? (very large user base, many keys)
-> Need burst support? -> Token Bucket or GCRA
-> No burst needed? -> Fixed Window or Sliding Window Counter
Latency first? (every millisecond matters)
-> Can you accept per-server limits? -> In-process Token Bucket
-> Need distributed accuracy? -> Hybrid (local cache + Redis)
Burst support is required?
-> Need smooth output to downstream? -> Token Bucket + Leaky Bucket combination
-> Just need burst allowance? -> Token Bucket
Simplicity first? (prototype, internal tool)
-> Fixed Window Counter
Decision Tree 2: Choosing Where to Implement
START: Who are your clients?
Unknown / public / unauthenticated?
-> Add CDN/Edge rate limiting first (free with Cloudflare)
-> Add Nginx per-IP limits as backup
-> Add app-level per-user after they authenticate
Known developers / API customers?
-> API Gateway (Kong or AWS API Gateway)
-> Add app-level for business-logic limits
Internal services only?
-> Service mesh (Istio/Envoy) for service-to-service
-> App-level for user-facing
All of the above (typical production)?
-> Layer CDN + Nginx + App-level (defense in depth)
-> API Gateway only if developer portal is needed
Decision Tree 3: Fail Strategy
START: What happens if your rate limiter fails?
Is this endpoint security-critical?
(login, payment, account creation, admin operations)
-> Fail-Closed: better to be unavailable than exploited
-> Alert and page on-call immediately
Is this endpoint user-facing but not security-critical?
(dashboard, product pages, search, data queries)
-> Local fallback limiter
-> Fail-Open after local limiter exhausts
-> Alert but don't page
Is this endpoint a health check or monitoring endpoint?
-> Never rate limit at all
-> Always fail-open (or exempt from rate limiting entirely)
Is this endpoint internal or infrastructure?
-> Fail-Open with monitoring
-> Internal services calling each other can handle brief over-limits
Summary Reference Card
ALGORITHM SELECTION:
Default choice: Sliding Window Counter (accuracy + memory balance)
Need bursting: Token Bucket
Security-critical: Sliding Window Log (exact)
Traffic shaping: Leaky Bucket
Memory-minimal: GCRA (1 float per user)
PLACEMENT SELECTION:
Public API protection: CDN + Load Balancer
User-level limits: Application middleware + Redis
Developer platform: API Gateway
Service-to-service: Service Mesh
STORAGE SELECTION:
Standard: Redis (best balance)
AWS Serverless: DynamoDB
Local only: In-process (single instance only)
FAILURE STRATEGY:
Security endpoints: Fail-Closed
General endpoints: Local fallback -> Fail-Open
Health/monitoring: Always Open (exempt from limiting)
IDENTIFIER PRIORITY:
Best: Authenticated User ID
Good: API Key hash
Fair: IP address (authenticated-aware limit)
Emergency: Subnet or ASN-level
Next Supplement: Supplement 4 - Architecture Patterns and Decision-Making