Rate Limiting - Supplement 3: Trade-Offs and Decision Guide

Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 4 - Architecture Patterns

A comprehensive decision guide for every major rate limiting choice.
Each section answers "when should I choose A over B?" with concrete criteria,
trade-off tables, and real-world context.

When to Rate Limit (and When NOT to)
Where to Place the Rate Limiter: Full Decision Matrix
Algorithm Trade-Offs: Deep Comparison
Storage Backend Trade-Offs
Centralized vs Decentralized vs Hybrid
Accuracy vs Memory vs Latency Triangle
Hard Limit vs Soft Limit vs Throttle vs Queue
Per-User vs Per-IP vs Per-Key vs Global
Fail-Open vs Fail-Closed: The Full Analysis
Rate Limiting Library vs Build Your Own
Infrastructure Cost Comparison
Trade-Off Decision Trees

1. When to Rate Limit (and When NOT to)

Always Rate Limit These

Scenario	Reason	Minimum Recommendation
Any public-facing API	Unknown clients, potential abuse	Always
Authentication endpoints (login, password reset)	Brute force prevention	5-10 per minute per IP+username
Payment / financial operations	Fraud prevention, cost control	10-25 per minute per user
Expensive compute endpoints (ML, search, reports)	CPU/GPU cost protection	2-10 per hour per user
File upload/download endpoints	Bandwidth cost control	50-100 per hour per user
Webhook/notification sending	Downstream service protection	10-100 per minute per destination
Email/SMS sending endpoints	Spam prevention + cost	10-30 per hour per user
Third-party API calls you relay	Respect upstream limits	Match upstream limit
WebSocket connections	Connection pool protection	5-10 concurrent per user

Rate Limit Carefully (Context Dependent)

Scenario	When to Limit	When to Exempt or Raise
Internal service-to-service calls	When called service is shared and expensive	When caller is the only consumer
Batch processing endpoints	Always, but with much higher limits	Never exempt entirely
Read-only cached endpoints	IP-based flood protection only	Authenticated reads can be very high
Admin endpoints	Yes - even admins can have bugs	Admins get higher limits, not none
Health check endpoints	Never - must always be accessible	Always exempt from rate limits
Metrics/monitoring endpoints	Never - monitoring must always work	Always exempt

When NOT to Rate Limit (or Exempt Entirely)

DO NOT rate limit:
  - /health, /ping, /ready, /live  (k8s probes, load balancer health checks)
  - /metrics (Prometheus scraping)
  - Internal monitoring agents
  - Your own CDN origin pull requests
  - Pre-flight CORS OPTIONS requests (or use very high limit)

Why: These endpoints being rate limited causes:
  - Load balancers removing healthy instances from rotation
  - Kubernetes killing pods that fail liveness/readiness probes
  - Monitoring going blind right when you need it most (during incidents)
  - CDN cache poisoning when origin cannot be reached for cache refresh

Implementation:

# Exemption list - checked BEFORE rate limiting
EXEMPT_PATHS = frozenset([
    "/health",
    "/ping",
    "/ready",
    "/live",
    "/metrics",
    "/favicon.ico",
])
 
EXEMPT_PATH_PREFIXES = frozenset([
    "/internal/monitoring/",
    "/actuator/",          # Spring Boot actuator
    "/_ah/",               # Google App Engine health
])
 
def should_rate_limit(request) -> bool:
    path = request.path
    if path in EXEMPT_PATHS:
        return False
    if any(path.startswith(p) for p in EXEMPT_PATH_PREFIXES):
        return False
    return True

2. Where to Place the Rate Limiter: Full Decision Matrix

The Seven-Layer Model

Layer 1:  DNS / GeoDNS         - Geographic blocking, routing
Layer 2:  CDN / Edge (Cloudflare, Akamai, Fastly) - DDoS, IP blocking
Layer 3:  Load Balancer (Nginx, HAProxy, AWS ALB)  - IP rate limiting
Layer 4:  API Gateway (Kong, AWS API GW, Apigee)   - API key / user limits
Layer 5:  Application Middleware / Filter           - Business logic limits
Layer 6:  Service Mesh Sidecar (Envoy, Istio)      - Service-to-service limits
Layer 7:  Database / Queue / External Service       - Resource-level limits

Layer-by-Layer Trade-Off Analysis

Layer 2: CDN / Edge

Dimension	Value
Latency added	0ms (happens before request reaches origin)
Context available	IP address, HTTP method, path, headers
Context NOT available	User identity, session, business logic
Best for	DDoS mitigation, geographic blocking, bot filtering, IP floods
Worst for	User-level limits, subscription tier enforcement
Cost to configure	Low (rules-based UI)
Cost if wrong	Medium (can block legitimate users)
Example: Cloudflare Rule	"Block IP if >1000 requests/5 minutes to /api/*"

Decision: Use Layer 2 if...
  You have a public-facing API with unknown clients
  You are being DDoS'd or scraped at high volume
  You need to block a geographic region for compliance
  You want defense before traffic reaches your servers

Do NOT rely on Layer 2 alone if...
  You need per-user or per-API-key rate limiting
  You need business logic (user tier, endpoint cost)
  You have users behind shared IPs (corporate, mobile)

Layer 3: Load Balancer

Dimension	Value
Latency added	0-1ms
Context available	IP, port, HTTP method, path
Context NOT available	Auth tokens, user identity, business context
Best for	Per-IP rate limiting across all upstream servers
Worst for	User-aware limits
Configuration complexity	Low (Nginx config)
State storage	Nginx shared memory (local only) or NLua+Redis

# Nginx: Per-IP, per-endpoint rate limiting
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=api:10m rate=100r/m;
limit_req_zone $http_x_api_key zone=apikey:20m rate=1000r/m;
 
location /api/auth/login {
    limit_req zone=login burst=3 nodelay;
    limit_req_status 429;
    proxy_pass http://backend;
}
location /api/ {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://backend;
}

When Nginx rate limiting is sufficient (no Redis needed):

IP-based limits only
Simple DDoS prevention
No per-user or per-key logic needed
Single load balancer (no need to share state)

Layer 4: API Gateway

Dimension	Value
Latency added	1-10ms (extra hop)
Context available	API keys, routes, consumers, plans
Context NOT available	Business logic (user tier in your DB)
Best for	API product monetization, developer portals, standard SaaS API
Worst for	Complex business logic limits, per-endpoint cost analysis
State storage	Gateway's own Redis / database

Choose API Gateway (Layer 4) when:
  - You are building an API product (developers are your customers)
  - You have multiple APIs and want one rate limit policy
  - You want self-service developer portal with usage dashboards
  - You are on a managed platform (AWS API Gateway, Azure APIM)

Choose Application layer (Layer 5) instead when:
  - Your rate limits depend on business logic (tier from your DB)
  - Rate limits vary per endpoint based on computational cost
  - You want fine-grained control without vendor lock-in
  - Your API is internal (not developer-facing)

Layer 5: Application Middleware

Dimension	Value
Latency added	1-20ms (Redis round trip)
Context available	Everything (user, tier, session, business state)
State storage	Redis (required for distributed)
Best for	Fine-grained business-logic-aware rate limiting
Worst for	Very high RPS where every ms matters
Flexibility	Highest - full code control
Maintenance	Your team owns it

# Full context available at Layer 5
def rate_limit_middleware(request: Request) -> Optional[Response]:
    user = request.state.user
    endpoint = request.url.path
    http_method = request.method
 
    # Business context available:
    limit = get_limit_for(
        tier=user.subscription_tier,        # from your DB (cached in Redis)
        endpoint=endpoint,
        method=http_method,
        cost=calculate_endpoint_cost(endpoint, request.json()),
        is_premium=user.is_premium,
        trust_level=user.trust_level
    )
    return check_and_respond(request, limit)

Layer 6: Service Mesh

Dimension	Value
Latency added	0-2ms (sidecar is local)
Context available	Service identity, request metadata
Context NOT available	Business logic
Best for	Inter-service rate limiting, protecting services from other services
Worst for	Per-user or per-consumer limits
Requires	Istio, Envoy, or Linkerd deployed

# Istio: Service A can only send 100 RPS to Service B
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: service-a-to-b-ratelimit
spec:
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_OUTBOUND
        cluster:
          service: service-b.default.svc.cluster.local
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.local_ratelimit
          typed_config:
            token_bucket:
              max_tokens: 100
              tokens_per_fill: 100
              fill_interval: 1s

Where-to-Place Decision Matrix

Requirement	Layer 2 CDN	Layer 3 LB	Layer 4 GW	Layer 5 App	Layer 6 Mesh
DDoS protection	Best	Good	Poor	Poor	N/A
IP-based rate limiting	Best	Best	Good	Good	N/A
User-based rate limiting	No	No	Partial	Best	No
API key rate limiting	Limited	Limited	Best	Best	No
Subscription tier limits	No	No	Partial	Best	No
Service-to-service limits	No	No	No	Good	Best
GraphQL cost limiting	No	No	No	Best	No
Custom business logic	No	No	No	Best	No
Latency overhead	None	Minimal	Low-Medium	Medium	Low
Vendor lock-in risk	High	Low	Medium-High	None	Medium

Recommended combination for most production systems:

Layer 2 (CDN)   : DDoS + IP flood protection
Layer 3 (Nginx) : Rate limit at 10x the app limit (backstop)
Layer 5 (App)   : Per-user business logic limits with Redis

Skip Layer 4 (API Gateway) unless you are building a developer platform.
Skip Layer 6 (Service Mesh) unless you are in a mature microservices org.

3. Algorithm Trade-Offs: Deep Comparison

Fixed Window vs Sliding Window Counter

Criterion	Fixed Window	Sliding Window Counter
Memory	1 counter per window per user	2 counters per user (current + previous)
Accuracy	Lower (boundary bug)	High (~0.1% error)
Implementation complexity	Very low	Low
Redis commands	INCR, EXPIRE	INCR x2, GET x2, EXPIRE x2 (Lua)
Max spike at boundary	2x the limit	~1.05x the limit
Appropriate for...	Internal APIs, quick protects	Most production APIs

Choose Fixed Window when:

Simplicity is paramount (prototype, quick fix)
The boundary window attack is acceptable (internal, trusted callers)
Memory is extremely constrained
The limit is loose enough that 2x burst is fine

Choose Sliding Window Counter when:

External or public-facing API
Accurate limit enforcement matters
Building a developer platform where customers count exact requests
Limit is tight (e.g., 5 requests/minute for a sensitive endpoint)

Token Bucket vs Leaky Bucket

Criterion	Token Bucket	Leaky Bucket
Burst support	YES - full capacity burst	NO - always fixed rate
Output smoothness	Variable (bursty)	Perfectly smooth
Memory	O(1): tokens + last_refill	O(capacity): queue
Complexity	Low	Medium
Latency added	None (instant decision)	Adds latency (request waits in queue)
CPU impact	None	Queue management overhead
Appropriate for...	User-facing APIs	Traffic shaping, DB protection

Choose Token Bucket when:

Users naturally have bursty access patterns (open app, load feed = 20 requests at once)
You want to allow short bursts while limiting sustained throughput
Low-latency decisions are required (no queuing)
This is the correct choice for 90% of API rate limiting

Choose Leaky Bucket when:

You are protecting a downstream service that MUST receive smooth traffic
(e.g., a payment gateway that fails on spikes, an ML inference service with strict SLA)
You are doing network traffic shaping (not HTTP APIs)
You want to convert bursty inbound traffic into smooth outbound traffic

Sliding Window Log vs Sliding Window Counter

Criterion	Sliding Window Log	Sliding Window Counter
Memory	O(limit) per user	O(1) per user
Accuracy	Perfect	~0.1% error
Redis structure	Sorted Set (ZADD)	String (GET/INCR)
Redis commands per request	ZREMRANGEBYSCORE, ZADD, ZCARD	GET x2, INCR, EXPIRE x2 (Lua)
Memory at scale (limit=1000, 1M users)	~8 GB	~30 MB
Appropriate for...	Low-limit, high-security endpoints	General API endpoints

Choose Sliding Window Log when:

Limit is very low (5-20 requests/minute) so memory cost is negligible
Perfect accuracy is required (authentication, payment, compliance)
You can afford the memory cost

Choose Sliding Window Counter when:

Limit is high (100+ per minute) and memory matters
You have many users (100K+)
0.1% accuracy margin is acceptable (it is for 99% of use cases)

4. Storage Backend Trade-Offs

Redis vs DynamoDB vs In-Memory vs PostgreSQL

Property	Redis	DynamoDB	In-Memory	PostgreSQL
Latency	0.1-2ms	1-10ms	<0.1ms	2-20ms
Throughput	100K-1M ops/sec	40K-400K WCU/sec	Millions/sec	10K-100K/sec
Consistency	Strong (single node)	Eventual (default)	Local only	Strong
Persistence	Configurable (RDB/AOF)	Always persistent	No (lost on restart)	Always
Auto-expiry	YES (EXPIRE command)	YES (TTL attribute)	Via code	Via cron
Distributed	Yes (Cluster)	Yes (managed)	No	Yes (but slow)
Atomic operations	Yes (INCR, Lua)	Conditional writes	Yes (locked)	Yes (transactions)
Cost at scale	Infrastructure cost	Per-operation cost	Cheapest	Infrastructure
Managed service	ElastiCache	DynamoDB	N/A	RDS

Choose Redis when:

Highest throughput requirement (>10K RPS rate limit checks)
Low latency is critical (adds <2ms to request)
You need Lua scripting for complex atomic operations
You already have Redis in your infrastructure

Choose DynamoDB when:

You are all-in on AWS serverless
You want zero infrastructure management
You can tolerate ~5ms rate limit check latency
You prefer pay-per-use pricing (no Redis instance to manage)

DynamoDB rate limiter:

import boto3
from boto3.dynamodb.conditions import Attr
import time
 
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("rate_limits")
 
def is_allowed_dynamodb(user_id: str, limit: int, window: int) -> bool:
    now = int(time.time())
    window_id = now // window
    expires_at = (window_id + 2) * window  # TTL for DynamoDB auto-cleanup
 
    try:
        response = table.update_item(
            Key={"pk": f"rl:{user_id}:{window_id}"},
            UpdateExpression="SET #cnt = if_not_exists(#cnt, :zero) + :one, #ttl = :ttl",
            ConditionExpression=Attr("#cnt").lt(limit) | Attr("#cnt").not_exists(),
            ExpressionAttributeNames={"#cnt": "count", "#ttl": "ttl"},
            ExpressionAttributeValues={":zero": 0, ":one": 1, ":ttl": expires_at},
            ReturnValues="UPDATED_NEW"
        )
        return True
    except dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
        return False  # limit exceeded

Choose In-Memory when:

Single instance deployment
Sticky sessions guaranteed
Rate limit state loss on restart is acceptable
Maximum possible throughput needed (local ML inference services, etc.)

5. Centralized vs Decentralized vs Hybrid

Full Trade-Off Analysis

Property	Centralized (Redis)	Decentralized (Local)	Hybrid
Accuracy	Exact	Per-server (N instances = N*limit)	Approximate (configurable)
Latency overhead	+1-5ms (Redis RTT)	0ms (in-process)	<1ms avg
Failure mode	Redis down = fail-open	Server down = unaffected	Degrades to local
Implementation	Moderate	Simple	Complex
Works with horizontal scale	Yes	No (each server is independent)	Yes
Works with serverless	Yes	Partially (per-container)	Yes
Memory efficiency	Centralized	Distributed (duplicated)	Centralized with local cache

When to Use Each

Centralized:

Use when:
- Exact enforcement is required (payments, authentication, compliance)
- You have more than 2 application instances
- You need consistent limits across all servers
- You already have Redis infrastructure

Acceptable drawback:
- +1-5ms per request (this is the price of accuracy)

Decentralized (Local):

Use when:
- Single-instance service (dev, small internal tool)
- Rate limiting the SERVICE (not individual users) - "protect this server"
- Outbound rate limiting (your service calling external APIs)
- You cannot afford Redis infrastructure

Not acceptable when:
- Multiple instances (limits multiply by instance count)
- You need accurate per-user enforcement

Hybrid:

Use when:
- Very high RPS where Redis latency is a concern (>50K RPS)
- You want to reduce Redis load by ~80%
- Approximate limits are acceptable (allow up to 10% over-limit)

Example at 100K RPS, limit=100/min, 10 servers:
  Decentralized: each server enforces 100/min -> effective = 1000/min (wrong!)
  Centralized: 100K Redis ops/sec (Redis CPU: ~60%)
  Hybrid: Each server reserves 30 locally, checks global every 1 second
    -> 10K Redis sync ops/sec instead of 100K (90% Redis load reduction)
    -> Accuracy: within 30% of limit (acceptable for non-critical)

6. Accuracy vs Memory vs Latency Triangle

There is a fundamental triangle trade-off. You cannot optimize all three simultaneously.

                    ACCURACY
                    (exact limits)
                    /\
                   /  \
                  /    \
                 /      \
                /        \
               /   Pick   \
              /    Any 2   \
             /______________\
      MEMORY                LATENCY
  (O(1) per user)      (minimal overhead)

The Trade-Off Explained

Accuracy + Low Memory, Sacrificing Latency:

Algorithm: Centralized Redis Sliding Window Counter (Lua script)
Memory: O(1) per user (good)
Accuracy: ~0.1% error (good)
Latency: +2-5ms per request (Redis round trip, cannot avoid)

Accuracy + Low Latency, Sacrificing Memory:

Algorithm: In-process Sliding Window Log
Memory: O(limit) per user (poor at scale)
Accuracy: Perfect
Latency: Sub-millisecond (in-process, no Redis)

Low Memory + Low Latency, Sacrificing Accuracy:

Algorithm: In-process Token Bucket (no Redis)
Memory: O(1) per user (good)
Accuracy: Per-server only (poor for distributed)
Latency: Sub-millisecond

The Sweet Spot for Most Production Systems:

Algorithm: Redis Sliding Window Counter (Lua script)
Accept: +2ms latency for Redis
Get:    High accuracy (~0.1% error), O(1) memory, works distributed

This covers 95% of production use cases. Only optimize further when:
  - Latency SLA is extremely tight (<10ms total response time)
  - Scale is extreme (>100K RPS)

Quantifying the Trade-Offs

At 10,000 RPS with 100,000 users, limit=100/min:

Option 1: Redis Sliding Window Counter (Lua)
  Memory:   100,000 x 2 keys x 30 bytes = 6 MB Redis
  Latency:  +2ms per request (Redis RTT)
  Accuracy: 99.9% (0.1% error from weighting)
  Cost:     1 Redis node, ~$50-100/month

Option 2: Redis Sliding Window Log (Sorted Set)
  Memory:   100,000 x 100 entries x 50 bytes = 500 MB Redis
  Latency:  +3ms per request (more Redis commands)
  Accuracy: 100% (perfect)
  Cost:     1 Redis node with more RAM, ~$150-200/month

Option 3: In-Process Token Bucket
  Memory:   100,000 x 16 bytes = 1.6 MB per server x 10 servers = 16 MB
  Latency:  +0.01ms (in-process)
  Accuracy: ~10% error (each server handles 10% of traffic, limit*10)
  Cost:     $0 extra (no Redis needed)
  Problem:  Users can make 1000/min instead of 100/min

Option 4: Hybrid (Local + Redis)
  Memory:   16 MB local + 6 MB Redis
  Latency:  +0.01ms for 90% of requests, +2ms for 10%
  Accuracy: ~95% (allows up to 30% over-limit in edge cases)
  Cost:     1 Redis node, ~$50/month
  Best for: Non-critical endpoints at extreme scale

7. Hard Limit vs Soft Limit vs Throttle vs Queue

When to Use Each Response Strategy

Strategy	Behavior	Response	When to Use
Hard Limit	Reject immediately	429	Security-critical, payment APIs
Soft Limit	Allow with warning	200 + warning header	Developer APIs, gradual enforcement
Throttle (delay)	Queue and delay	200 (after wait)	Background jobs, batch processing
Queue (async)	Accept, process later	202 Accepted	Long-running ops, webhook dispatch
Shed (degrade)	Return cached/degraded	200 (partial)	High availability priority

Decision Framework

Question 1: Can the request be safely rejected?
  If NO (e.g., user is in a checkout flow, payment in flight):
    Use Soft Limit or Queue, not Hard Limit
  If YES:
    Question 2: Is this a security-critical endpoint?
      If YES (login, payment, delete): Hard Limit
      If NO: Question 3

Question 3: Is the caller a human or a machine?
  Human: Use Soft Limit (warn before cutting off)
  Machine: Use Hard Limit or Throttle (machines should handle 429)

Question 4: Can the request be deferred?
  YES (report generation, bulk export, email send): Use Queue (202 Accepted)
  NO (real-time query, user is waiting): Hard Limit

Implementing Degraded Response (Graceful Degradation)

class GracefulRateLimiter:
    """
    Instead of hard-rejecting, return stale or partial data when over limit.
    Use only for read endpoints where stale data is acceptable.
    """
 
    def __init__(self, limiter, cache):
        self.limiter = limiter
        self.cache = cache
 
    def handle_request(self, request, user_id: str) -> Response:
        result = self.limiter.check(user_id)
 
        if result["allowed"]:
            # Normal path: fresh data
            data = fetch_fresh_data(request)
            self.cache.set(request.path, data, ttl=60)
            return Response(data, headers=self._rl_headers(result))
 
        # Over limit: try cached/stale data
        stale = self.cache.get_stale(request.path)
        if stale:
            return Response(
                stale,
                headers={
                    **self._rl_headers(result),
                    "X-Cache": "STALE",
                    "X-RateLimit-Degraded": "true",
                    "Warning": "199 - Response may be stale due to rate limiting"
                }
            )
 
        # No stale data available: hard reject
        return Response(
            {"error": "rate_limit_exceeded"},
            status=429,
            headers=self._rl_headers(result)
        )

8. Per-User vs Per-IP vs Per-Key vs Global

When Each Is the Right Identifier

Identifier	Granularity	Authentication Required	Best For	Pitfalls
Per-User ID	Individual	Yes	All authenticated APIs	None (best option)
Per-API Key	Per credential	Yes (key-based)	Developer APIs, M2M	Key leakage, shared keys
Per-IP	IP-level	No	Unauthenticated endpoints	NAT, CGN, proxies
Per-IP + User-Agent	Better than IP	No	Unauthenticated + bot detect	Easily spoofed
Per-IP Subnet (/24)	Subnet-level	No	CGN, corporate networks	Whole company impacted
Global	System-wide	No	Infrastructure protection	Unfair (one user affects all)

The Layered Identifier Strategy

def get_rate_limit_identifiers(request) -> list[tuple[str, int]]:
    """
    Return list of (identifier, limit) pairs.
    ALL must pass for the request to be allowed.
    Each provides a different layer of protection.
    """
    identifiers = []
    window = 60  # 1 minute window for all
 
    # Layer 1: Per-user (most specific, highest limit)
    user_id = extract_user_id(request)
    if user_id:
        tier_limit = get_tier_limit(user_id)
        identifiers.append((f"user:{user_id}", tier_limit))
 
    # Layer 2: Per-API-key (for machine clients)
    api_key = request.headers.get("X-API-Key")
    if api_key:
        identifiers.append((f"apikey:{hash_key(api_key)}", 5000))
 
    # Layer 3: Per-IP (coarse, protects against unauthenticated floods)
    ip = get_real_ip(request)
    ip_limit = 1000 if is_cgn_ip(ip) else 200  # higher for NAT
    identifiers.append((f"ip:{ip}", ip_limit))
 
    # Layer 4: Global (system-wide protection)
    identifiers.append(("global:system", 1_000_000))
 
    return identifiers

The Fallback Chain

Best case: Authenticated user -> rate limit by user ID
Good:      API key present -> rate limit by key hash
Fair:      No auth, regular IP -> rate limit by IP (lower limit)
Coarse:    No auth, CGN/corporate IP -> rate limit by IP with higher limit
Emergency: DDoS detected -> temporary geographic block at CDN

9. Fail-Open vs Fail-Closed: The Full Analysis

Decision Matrix by Endpoint Type

Endpoint Type	Risk of Fail-Open	Risk of Fail-Closed	Recommendation
Public read API	Low	Medium	Fail-Open
User dashboard	Low	Medium	Fail-Open
Search API	Low	Medium	Fail-Open
Payment/charge API	Very High	Low	Fail-Closed
Authentication/login	High	Low	Fail-Closed
Account creation	High	Medium	Fail-Closed
Admin operations	High	Low	Fail-Closed
Password reset	High	Low	Fail-Closed
File upload	Medium	Medium	Local fallback
Report generation	Low	High (expensive)	Fail-Open with quota
Internal health check	N/A	Very High (monitoring blind)	Always Open

The Local Fallback Strategy (Best of Both)

from enum import Enum
import time
 
class FailPolicy(Enum):
    OPEN = "open"        # Allow all requests when Redis down
    CLOSED = "closed"    # Deny all requests when Redis down
    LOCAL = "local"      # Use local rate limiter as fallback
 
class ResilientRateLimiter:
    """
    Three-mode rate limiter:
    - HEALTHY: Use Redis (accurate, distributed)
    - DEGRADED: Use local limiter (approximate, single-instance)
    - FAILED: Fail-open or fail-closed based on policy
    """
 
    def __init__(
        self,
        redis_limiter,
        local_limiter,
        policy: FailPolicy = FailPolicy.LOCAL,
        failure_threshold: int = 3,
        recovery_timeout: int = 30
    ):
        self.redis_limiter = redis_limiter
        self.local_limiter = local_limiter
        self.policy = policy
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.circuit_opened_at = None
 
    def is_allowed(self, identifier: str) -> dict:
        # Check if circuit is open (Redis failed recently)
        if self.circuit_opened_at:
            if time.time() - self.circuit_opened_at > self.recovery_timeout:
                # Try to recover
                self.circuit_opened_at = None
                self.failure_count = 0
            else:
                return self._handle_degraded(identifier)
 
        try:
            result = self.redis_limiter.is_allowed(identifier)
            self.failure_count = 0  # Reset on success
            return result
 
        except Exception as e:
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.circuit_opened_at = time.time()
                # Alert!
            return self._handle_degraded(identifier)
 
    def _handle_degraded(self, identifier: str) -> dict:
        if self.policy <mark class="obsidian-highlight"> FailPolicy.OPEN:
            return {"allowed": True, "mode": "fail_open", "reason": "redis_unavailable"}
 
        if self.policy </mark> FailPolicy.CLOSED:
            return {"allowed": False, "mode": "fail_closed", "reason": "redis_unavailable"}
 
        # LOCAL fallback
        local_result = self.local_limiter.is_allowed(identifier)
        local_result["mode"] = "local_fallback"
        local_result["note"] = "approximate_limits"
        return local_result

10. Rate Limiting Library vs Build Your Own

Library Options by Language

Language	Library	Algorithm	Redis-backed	Notes
Java	Bucket4j	Token Bucket	Yes (Lettuce/Jedis)	Best Java option, production-grade
Java	Resilience4j RateLimiter	Fixed Window	No	Works with CB, annotation-driven
Java	Guava RateLimiter	Token Bucket	No	Google Guava, local only
Python	Flask-Limiter	Configurable	Yes (Redis)	Flask-specific
Python	slowapi	Configurable	Yes (Redis)	FastAPI-native
Python	limits	All algorithms	Yes	Framework-agnostic
Node.js	express-rate-limit	Fixed/Sliding	Yes (RedisStore)	Most popular
Go	golang.org/x/time/rate	Token Bucket	No	Standard library, local only
Go	throttled	GCRA	Yes (Redis)	Similar to Stripe's approach
Ruby	rack-attack	Configurable	Yes (Redis)	Rack middleware

Build vs Buy Decision

Use a library when:

Standard use case (fixed window, token bucket, sliding window counter)
One of the supported languages above
Speed to production matters
You want community support and battle-tested code
Library supports your required storage backend

Build your own when:

You need custom algorithm behavior not supported by any library
You need extreme performance optimization for your specific access patterns
You are integrating with a novel storage backend
You have compliance requirements that restrict open-source dependency usage
Custom rate limit logic (e.g., cost-based per-query complexity)

The real question: How much is custom logic?

Pure fixed window with Redis? -> Use library (Flask-Limiter / express-rate-limit)
Token bucket with Redis? -> Use Bucket4j or equivalent
Custom GraphQL cost analysis? -> Build the cost analyzer, use library for the bucket
Adaptive rate limiting with ML? -> Build it (no library does this)
Multi-dimensional with 5 limits? -> Build on top of library primitives

11. Infrastructure Cost Comparison

Scenario: 100,000 active users, 10,000 RPS peak, 100 requests/minute limit per user

Option A: Redis Standalone (AWS ElastiCache)

Redis instance: cache.m5.large ($0.10/hour = $72/month)
Memory needed: 100,000 users x 2 keys x 60 bytes = 12 MB
Redis CPU at 10K RPS: ~15% (well within limits)
Latency overhead: +2ms per request
Annual cost: ~$864

Option B: Redis Cluster (3 primary + 3 replica)

Redis cluster: 6x cache.t3.micro ($0.017/hour x 6 = $74/month)
Memory: 12 MB split across 3 primaries
CPU: 3% per shard
Latency: +2ms (same as standalone for this scale)
Annual cost: ~$888
Use when: Availability is critical, not when scale demands it at this size

Option C: DynamoDB On-Demand

Writes (INCR equivalent): 10,000 WCU/sec x $0.0000012 = $0.012/second = $864/month
Reads (0 - we don't read separately): $0
Storage: 12 MB x $0.25/GB/month = $0.003/month
Total: ~$864/month (similar to Redis but scales automatically)
Annual cost: ~$10,368 (vs $864 for Redis)
Use when: Serverless architecture, no Redis allowed, cost is secondary

Option D: Nginx Local (no Redis, per-IP only)

Additional infrastructure: $0 (uses existing Nginx)
Limitation: Per-IP only, no per-user limits
Use when: IP-based limits are sufficient (public site protection)
Annual cost: $0 additional

Cost Summary:

Option	Monthly Cost	Latency	Accuracy	Scales to 1M users?
Redis Standalone	$72	+2ms	99.9%	Yes (upgrade instance)
Redis Cluster	$74	+2ms	99.9%	Yes (add shards)
DynamoDB On-Demand	$864	+5ms	99.9%	Yes (auto)
Nginx Local	$0	+0ms	IP-only	No (IP-only)
In-Process	$0	+0ms	Per-instance	No (N*limit)

12. Trade-Off Decision Trees

Decision Tree 1: Choosing an Algorithm

START: What is your primary constraint?

Accuracy first? (security, payment, compliance)
  -> Low user count (<10K)? -> Sliding Window Log
  -> High user count (>10K)? -> Sliding Window Counter + Redis Lua (99.9% accurate)

Memory first? (very large user base, many keys)
  -> Need burst support? -> Token Bucket or GCRA
  -> No burst needed? -> Fixed Window or Sliding Window Counter

Latency first? (every millisecond matters)
  -> Can you accept per-server limits? -> In-process Token Bucket
  -> Need distributed accuracy? -> Hybrid (local cache + Redis)

Burst support is required?
  -> Need smooth output to downstream? -> Token Bucket + Leaky Bucket combination
  -> Just need burst allowance? -> Token Bucket

Simplicity first? (prototype, internal tool)
  -> Fixed Window Counter

Decision Tree 2: Choosing Where to Implement

START: Who are your clients?

Unknown / public / unauthenticated?
  -> Add CDN/Edge rate limiting first (free with Cloudflare)
  -> Add Nginx per-IP limits as backup
  -> Add app-level per-user after they authenticate

Known developers / API customers?
  -> API Gateway (Kong or AWS API Gateway)
  -> Add app-level for business-logic limits

Internal services only?
  -> Service mesh (Istio/Envoy) for service-to-service
  -> App-level for user-facing

All of the above (typical production)?
  -> Layer CDN + Nginx + App-level (defense in depth)
  -> API Gateway only if developer portal is needed

Decision Tree 3: Fail Strategy

START: What happens if your rate limiter fails?

Is this endpoint security-critical?
  (login, payment, account creation, admin operations)
  -> Fail-Closed: better to be unavailable than exploited
  -> Alert and page on-call immediately

Is this endpoint user-facing but not security-critical?
  (dashboard, product pages, search, data queries)
  -> Local fallback limiter
  -> Fail-Open after local limiter exhausts
  -> Alert but don't page

Is this endpoint a health check or monitoring endpoint?
  -> Never rate limit at all
  -> Always fail-open (or exempt from rate limiting entirely)

Is this endpoint internal or infrastructure?
  -> Fail-Open with monitoring
  -> Internal services calling each other can handle brief over-limits

Summary Reference Card

ALGORITHM SELECTION:
  Default choice:   Sliding Window Counter (accuracy + memory balance)
  Need bursting:    Token Bucket
  Security-critical: Sliding Window Log (exact)
  Traffic shaping:  Leaky Bucket
  Memory-minimal:   GCRA (1 float per user)

PLACEMENT SELECTION:
  Public API protection:  CDN + Load Balancer
  User-level limits:      Application middleware + Redis
  Developer platform:     API Gateway
  Service-to-service:     Service Mesh

STORAGE SELECTION:
  Standard:        Redis (best balance)
  AWS Serverless:  DynamoDB
  Local only:      In-process (single instance only)

FAILURE STRATEGY:
  Security endpoints: Fail-Closed
  General endpoints:  Local fallback -> Fail-Open
  Health/monitoring:  Always Open (exempt from limiting)

IDENTIFIER PRIORITY:
  Best:      Authenticated User ID
  Good:      API Key hash
  Fair:      IP address (authenticated-aware limit)
  Emergency: Subnet or ASN-level

Next Supplement: Supplement 4 - Architecture Patterns and Decision-Making

Series: Rate Limiting Demystified