← Back to Articles
6/6/2026Admin Post

rate limiting supplement3 tradeoffs decision guide

Rate Limiting - Supplement 3: Trade-Offs and Decision Guide

Series Navigation:
Main Index |
Supplement 1 - Anti-Patterns Extended |
Supplement 2 - Production Challenges |
Supplement 4 - Architecture Patterns

A comprehensive decision guide for every major rate limiting choice.
Each section answers "when should I choose A over B?" with concrete criteria,
trade-off tables, and real-world context.


Table of Contents

  1. When to Rate Limit (and When NOT to)
  2. Where to Place the Rate Limiter: Full Decision Matrix
  3. Algorithm Trade-Offs: Deep Comparison
  4. Storage Backend Trade-Offs
  5. Centralized vs Decentralized vs Hybrid
  6. Accuracy vs Memory vs Latency Triangle
  7. Hard Limit vs Soft Limit vs Throttle vs Queue
  8. Per-User vs Per-IP vs Per-Key vs Global
  9. Fail-Open vs Fail-Closed: The Full Analysis
  10. Rate Limiting Library vs Build Your Own
  11. Infrastructure Cost Comparison
  12. Trade-Off Decision Trees

1. When to Rate Limit (and When NOT to)

Always Rate Limit These

ScenarioReasonMinimum Recommendation
Any public-facing APIUnknown clients, potential abuseAlways
Authentication endpoints (login, password reset)Brute force prevention5-10 per minute per IP+username
Payment / financial operationsFraud prevention, cost control10-25 per minute per user
Expensive compute endpoints (ML, search, reports)CPU/GPU cost protection2-10 per hour per user
File upload/download endpointsBandwidth cost control50-100 per hour per user
Webhook/notification sendingDownstream service protection10-100 per minute per destination
Email/SMS sending endpointsSpam prevention + cost10-30 per hour per user
Third-party API calls you relayRespect upstream limitsMatch upstream limit
WebSocket connectionsConnection pool protection5-10 concurrent per user

Rate Limit Carefully (Context Dependent)

ScenarioWhen to LimitWhen to Exempt or Raise
Internal service-to-service callsWhen called service is shared and expensiveWhen caller is the only consumer
Batch processing endpointsAlways, but with much higher limitsNever exempt entirely
Read-only cached endpointsIP-based flood protection onlyAuthenticated reads can be very high
Admin endpointsYes - even admins can have bugsAdmins get higher limits, not none
Health check endpointsNever - must always be accessibleAlways exempt from rate limits
Metrics/monitoring endpointsNever - monitoring must always workAlways exempt

When NOT to Rate Limit (or Exempt Entirely)

DO NOT rate limit:
  - /health, /ping, /ready, /live  (k8s probes, load balancer health checks)
  - /metrics (Prometheus scraping)
  - Internal monitoring agents
  - Your own CDN origin pull requests
  - Pre-flight CORS OPTIONS requests (or use very high limit)

Why: These endpoints being rate limited causes:
  - Load balancers removing healthy instances from rotation
  - Kubernetes killing pods that fail liveness/readiness probes
  - Monitoring going blind right when you need it most (during incidents)
  - CDN cache poisoning when origin cannot be reached for cache refresh

Implementation:

# Exemption list - checked BEFORE rate limiting
EXEMPT_PATHS = frozenset([
    "/health",
    "/ping",
    "/ready",
    "/live",
    "/metrics",
    "/favicon.ico",
])
 
EXEMPT_PATH_PREFIXES = frozenset([
    "/internal/monitoring/",
    "/actuator/",          # Spring Boot actuator
    "/_ah/",               # Google App Engine health
])
 
def should_rate_limit(request) -> bool:
    path = request.path
    if path in EXEMPT_PATHS:
        return False
    if any(path.startswith(p) for p in EXEMPT_PATH_PREFIXES):
        return False
    return True

2. Where to Place the Rate Limiter: Full Decision Matrix

The Seven-Layer Model

Layer 1:  DNS / GeoDNS         - Geographic blocking, routing
Layer 2:  CDN / Edge (Cloudflare, Akamai, Fastly) - DDoS, IP blocking
Layer 3:  Load Balancer (Nginx, HAProxy, AWS ALB)  - IP rate limiting
Layer 4:  API Gateway (Kong, AWS API GW, Apigee)   - API key / user limits
Layer 5:  Application Middleware / Filter           - Business logic limits
Layer 6:  Service Mesh Sidecar (Envoy, Istio)      - Service-to-service limits
Layer 7:  Database / Queue / External Service       - Resource-level limits

Layer-by-Layer Trade-Off Analysis

Layer 2: CDN / Edge

DimensionValue
Latency added0ms (happens before request reaches origin)
Context availableIP address, HTTP method, path, headers
Context NOT availableUser identity, session, business logic
Best forDDoS mitigation, geographic blocking, bot filtering, IP floods
Worst forUser-level limits, subscription tier enforcement
Cost to configureLow (rules-based UI)
Cost if wrongMedium (can block legitimate users)
Example: Cloudflare Rule"Block IP if >1000 requests/5 minutes to /api/*"
Decision: Use Layer 2 if...
  You have a public-facing API with unknown clients
  You are being DDoS'd or scraped at high volume
  You need to block a geographic region for compliance
  You want defense before traffic reaches your servers

Do NOT rely on Layer 2 alone if...
  You need per-user or per-API-key rate limiting
  You need business logic (user tier, endpoint cost)
  You have users behind shared IPs (corporate, mobile)

Layer 3: Load Balancer

DimensionValue
Latency added0-1ms
Context availableIP, port, HTTP method, path
Context NOT availableAuth tokens, user identity, business context
Best forPer-IP rate limiting across all upstream servers
Worst forUser-aware limits
Configuration complexityLow (Nginx config)
State storageNginx shared memory (local only) or NLua+Redis
# Nginx: Per-IP, per-endpoint rate limiting
limit_req_zone $binary_remote_addr zone=login:10m rate=5r/m;
limit_req_zone $binary_remote_addr zone=api:10m rate=100r/m;
limit_req_zone $http_x_api_key zone=apikey:20m rate=1000r/m;
 
location /api/auth/login {
    limit_req zone=login burst=3 nodelay;
    limit_req_status 429;
    proxy_pass http://backend;
}
location /api/ {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://backend;
}

When Nginx rate limiting is sufficient (no Redis needed):

  • IP-based limits only
  • Simple DDoS prevention
  • No per-user or per-key logic needed
  • Single load balancer (no need to share state)

Layer 4: API Gateway

DimensionValue
Latency added1-10ms (extra hop)
Context availableAPI keys, routes, consumers, plans
Context NOT availableBusiness logic (user tier in your DB)
Best forAPI product monetization, developer portals, standard SaaS API
Worst forComplex business logic limits, per-endpoint cost analysis
State storageGateway's own Redis / database
Choose API Gateway (Layer 4) when:
  - You are building an API product (developers are your customers)
  - You have multiple APIs and want one rate limit policy
  - You want self-service developer portal with usage dashboards
  - You are on a managed platform (AWS API Gateway, Azure APIM)

Choose Application layer (Layer 5) instead when:
  - Your rate limits depend on business logic (tier from your DB)
  - Rate limits vary per endpoint based on computational cost
  - You want fine-grained control without vendor lock-in
  - Your API is internal (not developer-facing)

Layer 5: Application Middleware

DimensionValue
Latency added1-20ms (Redis round trip)
Context availableEverything (user, tier, session, business state)
State storageRedis (required for distributed)
Best forFine-grained business-logic-aware rate limiting
Worst forVery high RPS where every ms matters
FlexibilityHighest - full code control
MaintenanceYour team owns it
# Full context available at Layer 5
def rate_limit_middleware(request: Request) -> Optional[Response]:
    user = request.state.user
    endpoint = request.url.path
    http_method = request.method
 
    # Business context available:
    limit = get_limit_for(
        tier=user.subscription_tier,        # from your DB (cached in Redis)
        endpoint=endpoint,
        method=http_method,
        cost=calculate_endpoint_cost(endpoint, request.json()),
        is_premium=user.is_premium,
        trust_level=user.trust_level
    )
    return check_and_respond(request, limit)

Layer 6: Service Mesh

DimensionValue
Latency added0-2ms (sidecar is local)
Context availableService identity, request metadata
Context NOT availableBusiness logic
Best forInter-service rate limiting, protecting services from other services
Worst forPer-user or per-consumer limits
RequiresIstio, Envoy, or Linkerd deployed
# Istio: Service A can only send 100 RPS to Service B
apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: service-a-to-b-ratelimit
spec:
  configPatches:
    - applyTo: HTTP_FILTER
      match:
        context: SIDECAR_OUTBOUND
        cluster:
          service: service-b.default.svc.cluster.local
      patch:
        operation: INSERT_BEFORE
        value:
          name: envoy.filters.http.local_ratelimit
          typed_config:
            token_bucket:
              max_tokens: 100
              tokens_per_fill: 100
              fill_interval: 1s

Where-to-Place Decision Matrix

RequirementLayer 2 CDNLayer 3 LBLayer 4 GWLayer 5 AppLayer 6 Mesh
DDoS protectionBestGoodPoorPoorN/A
IP-based rate limitingBestBestGoodGoodN/A
User-based rate limitingNoNoPartialBestNo
API key rate limitingLimitedLimitedBestBestNo
Subscription tier limitsNoNoPartialBestNo
Service-to-service limitsNoNoNoGoodBest
GraphQL cost limitingNoNoNoBestNo
Custom business logicNoNoNoBestNo
Latency overheadNoneMinimalLow-MediumMediumLow
Vendor lock-in riskHighLowMedium-HighNoneMedium

Recommended combination for most production systems:

Layer 2 (CDN)   : DDoS + IP flood protection
Layer 3 (Nginx) : Rate limit at 10x the app limit (backstop)
Layer 5 (App)   : Per-user business logic limits with Redis

Skip Layer 4 (API Gateway) unless you are building a developer platform.
Skip Layer 6 (Service Mesh) unless you are in a mature microservices org.

3. Algorithm Trade-Offs: Deep Comparison

Fixed Window vs Sliding Window Counter

CriterionFixed WindowSliding Window Counter
Memory1 counter per window per user2 counters per user (current + previous)
AccuracyLower (boundary bug)High (~0.1% error)
Implementation complexityVery lowLow
Redis commandsINCR, EXPIREINCR x2, GET x2, EXPIRE x2 (Lua)
Max spike at boundary2x the limit~1.05x the limit
Appropriate for...Internal APIs, quick protectsMost production APIs

Choose Fixed Window when:

  • Simplicity is paramount (prototype, quick fix)
  • The boundary window attack is acceptable (internal, trusted callers)
  • Memory is extremely constrained
  • The limit is loose enough that 2x burst is fine

Choose Sliding Window Counter when:

  • External or public-facing API
  • Accurate limit enforcement matters
  • Building a developer platform where customers count exact requests
  • Limit is tight (e.g., 5 requests/minute for a sensitive endpoint)

Token Bucket vs Leaky Bucket

CriterionToken BucketLeaky Bucket
Burst supportYES - full capacity burstNO - always fixed rate
Output smoothnessVariable (bursty)Perfectly smooth
MemoryO(1): tokens + last_refillO(capacity): queue
ComplexityLowMedium
Latency addedNone (instant decision)Adds latency (request waits in queue)
CPU impactNoneQueue management overhead
Appropriate for...User-facing APIsTraffic shaping, DB protection

Choose Token Bucket when:

  • Users naturally have bursty access patterns (open app, load feed = 20 requests at once)
  • You want to allow short bursts while limiting sustained throughput
  • Low-latency decisions are required (no queuing)
  • This is the correct choice for 90% of API rate limiting

Choose Leaky Bucket when:

  • You are protecting a downstream service that MUST receive smooth traffic
    (e.g., a payment gateway that fails on spikes, an ML inference service with strict SLA)
  • You are doing network traffic shaping (not HTTP APIs)
  • You want to convert bursty inbound traffic into smooth outbound traffic

Sliding Window Log vs Sliding Window Counter

CriterionSliding Window LogSliding Window Counter
MemoryO(limit) per userO(1) per user
AccuracyPerfect~0.1% error
Redis structureSorted Set (ZADD)String (GET/INCR)
Redis commands per requestZREMRANGEBYSCORE, ZADD, ZCARDGET x2, INCR, EXPIRE x2 (Lua)
Memory at scale (limit=1000, 1M users)~8 GB~30 MB
Appropriate for...Low-limit, high-security endpointsGeneral API endpoints

Choose Sliding Window Log when:

  • Limit is very low (5-20 requests/minute) so memory cost is negligible
  • Perfect accuracy is required (authentication, payment, compliance)
  • You can afford the memory cost

Choose Sliding Window Counter when:

  • Limit is high (100+ per minute) and memory matters
  • You have many users (100K+)
  • 0.1% accuracy margin is acceptable (it is for 99% of use cases)

4. Storage Backend Trade-Offs

Redis vs DynamoDB vs In-Memory vs PostgreSQL

PropertyRedisDynamoDBIn-MemoryPostgreSQL
Latency0.1-2ms1-10ms<0.1ms2-20ms
Throughput100K-1M ops/sec40K-400K WCU/secMillions/sec10K-100K/sec
ConsistencyStrong (single node)Eventual (default)Local onlyStrong
PersistenceConfigurable (RDB/AOF)Always persistentNo (lost on restart)Always
Auto-expiryYES (EXPIRE command)YES (TTL attribute)Via codeVia cron
DistributedYes (Cluster)Yes (managed)NoYes (but slow)
Atomic operationsYes (INCR, Lua)Conditional writesYes (locked)Yes (transactions)
Cost at scaleInfrastructure costPer-operation costCheapestInfrastructure
Managed serviceElastiCacheDynamoDBN/ARDS

Choose Redis when:

  • Highest throughput requirement (>10K RPS rate limit checks)
  • Low latency is critical (adds <2ms to request)
  • You need Lua scripting for complex atomic operations
  • You already have Redis in your infrastructure

Choose DynamoDB when:

  • You are all-in on AWS serverless
  • You want zero infrastructure management
  • You can tolerate ~5ms rate limit check latency
  • You prefer pay-per-use pricing (no Redis instance to manage)

DynamoDB rate limiter:

import boto3
from boto3.dynamodb.conditions import Attr
import time
 
dynamodb = boto3.resource("dynamodb")
table = dynamodb.Table("rate_limits")
 
def is_allowed_dynamodb(user_id: str, limit: int, window: int) -> bool:
    now = int(time.time())
    window_id = now // window
    expires_at = (window_id + 2) * window  # TTL for DynamoDB auto-cleanup
 
    try:
        response = table.update_item(
            Key={"pk": f"rl:{user_id}:{window_id}"},
            UpdateExpression="SET #cnt = if_not_exists(#cnt, :zero) + :one, #ttl = :ttl",
            ConditionExpression=Attr("#cnt").lt(limit) | Attr("#cnt").not_exists(),
            ExpressionAttributeNames={"#cnt": "count", "#ttl": "ttl"},
            ExpressionAttributeValues={":zero": 0, ":one": 1, ":ttl": expires_at},
            ReturnValues="UPDATED_NEW"
        )
        return True
    except dynamodb.meta.client.exceptions.ConditionalCheckFailedException:
        return False  # limit exceeded

Choose In-Memory when:

  • Single instance deployment
  • Sticky sessions guaranteed
  • Rate limit state loss on restart is acceptable
  • Maximum possible throughput needed (local ML inference services, etc.)

5. Centralized vs Decentralized vs Hybrid

Full Trade-Off Analysis

PropertyCentralized (Redis)Decentralized (Local)Hybrid
AccuracyExactPer-server (N instances = N*limit)Approximate (configurable)
Latency overhead+1-5ms (Redis RTT)0ms (in-process)<1ms avg
Failure modeRedis down = fail-openServer down = unaffectedDegrades to local
ImplementationModerateSimpleComplex
Works with horizontal scaleYesNo (each server is independent)Yes
Works with serverlessYesPartially (per-container)Yes
Memory efficiencyCentralizedDistributed (duplicated)Centralized with local cache

When to Use Each

Centralized:

Use when:
- Exact enforcement is required (payments, authentication, compliance)
- You have more than 2 application instances
- You need consistent limits across all servers
- You already have Redis infrastructure

Acceptable drawback:
- +1-5ms per request (this is the price of accuracy)

Decentralized (Local):

Use when:
- Single-instance service (dev, small internal tool)
- Rate limiting the SERVICE (not individual users) - "protect this server"
- Outbound rate limiting (your service calling external APIs)
- You cannot afford Redis infrastructure

Not acceptable when:
- Multiple instances (limits multiply by instance count)
- You need accurate per-user enforcement

Hybrid:

Use when:
- Very high RPS where Redis latency is a concern (>50K RPS)
- You want to reduce Redis load by ~80%
- Approximate limits are acceptable (allow up to 10% over-limit)

Example at 100K RPS, limit=100/min, 10 servers:
  Decentralized: each server enforces 100/min -> effective = 1000/min (wrong!)
  Centralized: 100K Redis ops/sec (Redis CPU: ~60%)
  Hybrid: Each server reserves 30 locally, checks global every 1 second
    -> 10K Redis sync ops/sec instead of 100K (90% Redis load reduction)
    -> Accuracy: within 30% of limit (acceptable for non-critical)

6. Accuracy vs Memory vs Latency Triangle

There is a fundamental triangle trade-off. You cannot optimize all three simultaneously.

                    ACCURACY
                    (exact limits)
                    /\
                   /  \
                  /    \
                 /      \
                /        \
               /   Pick   \
              /    Any 2   \
             /______________\
      MEMORY                LATENCY
  (O(1) per user)      (minimal overhead)

The Trade-Off Explained

Accuracy + Low Memory, Sacrificing Latency:

  • Algorithm: Centralized Redis Sliding Window Counter (Lua script)
  • Memory: O(1) per user (good)
  • Accuracy: ~0.1% error (good)
  • Latency: +2-5ms per request (Redis round trip, cannot avoid)

Accuracy + Low Latency, Sacrificing Memory:

  • Algorithm: In-process Sliding Window Log
  • Memory: O(limit) per user (poor at scale)
  • Accuracy: Perfect
  • Latency: Sub-millisecond (in-process, no Redis)

Low Memory + Low Latency, Sacrificing Accuracy:

  • Algorithm: In-process Token Bucket (no Redis)
  • Memory: O(1) per user (good)
  • Accuracy: Per-server only (poor for distributed)
  • Latency: Sub-millisecond

The Sweet Spot for Most Production Systems:

Algorithm: Redis Sliding Window Counter (Lua script)
Accept: +2ms latency for Redis
Get:    High accuracy (~0.1% error), O(1) memory, works distributed

This covers 95% of production use cases. Only optimize further when:
  - Latency SLA is extremely tight (<10ms total response time)
  - Scale is extreme (>100K RPS)

Quantifying the Trade-Offs

At 10,000 RPS with 100,000 users, limit=100/min:

Option 1: Redis Sliding Window Counter (Lua)
  Memory:   100,000 x 2 keys x 30 bytes = 6 MB Redis
  Latency:  +2ms per request (Redis RTT)
  Accuracy: 99.9% (0.1% error from weighting)
  Cost:     1 Redis node, ~$50-100/month

Option 2: Redis Sliding Window Log (Sorted Set)
  Memory:   100,000 x 100 entries x 50 bytes = 500 MB Redis
  Latency:  +3ms per request (more Redis commands)
  Accuracy: 100% (perfect)
  Cost:     1 Redis node with more RAM, ~$150-200/month

Option 3: In-Process Token Bucket
  Memory:   100,000 x 16 bytes = 1.6 MB per server x 10 servers = 16 MB
  Latency:  +0.01ms (in-process)
  Accuracy: ~10% error (each server handles 10% of traffic, limit*10)
  Cost:     $0 extra (no Redis needed)
  Problem:  Users can make 1000/min instead of 100/min

Option 4: Hybrid (Local + Redis)
  Memory:   16 MB local + 6 MB Redis
  Latency:  +0.01ms for 90% of requests, +2ms for 10%
  Accuracy: ~95% (allows up to 30% over-limit in edge cases)
  Cost:     1 Redis node, ~$50/month
  Best for: Non-critical endpoints at extreme scale

7. Hard Limit vs Soft Limit vs Throttle vs Queue

When to Use Each Response Strategy

StrategyBehaviorResponseWhen to Use
Hard LimitReject immediately429Security-critical, payment APIs
Soft LimitAllow with warning200 + warning headerDeveloper APIs, gradual enforcement
Throttle (delay)Queue and delay200 (after wait)Background jobs, batch processing
Queue (async)Accept, process later202 AcceptedLong-running ops, webhook dispatch
Shed (degrade)Return cached/degraded200 (partial)High availability priority

Decision Framework

Question 1: Can the request be safely rejected?
  If NO (e.g., user is in a checkout flow, payment in flight):
    Use Soft Limit or Queue, not Hard Limit
  If YES:
    Question 2: Is this a security-critical endpoint?
      If YES (login, payment, delete): Hard Limit
      If NO: Question 3

Question 3: Is the caller a human or a machine?
  Human: Use Soft Limit (warn before cutting off)
  Machine: Use Hard Limit or Throttle (machines should handle 429)

Question 4: Can the request be deferred?
  YES (report generation, bulk export, email send): Use Queue (202 Accepted)
  NO (real-time query, user is waiting): Hard Limit

Implementing Degraded Response (Graceful Degradation)

class GracefulRateLimiter:
    """
    Instead of hard-rejecting, return stale or partial data when over limit.
    Use only for read endpoints where stale data is acceptable.
    """
 
    def __init__(self, limiter, cache):
        self.limiter = limiter
        self.cache = cache
 
    def handle_request(self, request, user_id: str) -> Response:
        result = self.limiter.check(user_id)
 
        if result["allowed"]:
            # Normal path: fresh data
            data = fetch_fresh_data(request)
            self.cache.set(request.path, data, ttl=60)
            return Response(data, headers=self._rl_headers(result))
 
        # Over limit: try cached/stale data
        stale = self.cache.get_stale(request.path)
        if stale:
            return Response(
                stale,
                headers={
                    **self._rl_headers(result),
                    "X-Cache": "STALE",
                    "X-RateLimit-Degraded": "true",
                    "Warning": "199 - Response may be stale due to rate limiting"
                }
            )
 
        # No stale data available: hard reject
        return Response(
            {"error": "rate_limit_exceeded"},
            status=429,
            headers=self._rl_headers(result)
        )

8. Per-User vs Per-IP vs Per-Key vs Global

When Each Is the Right Identifier

IdentifierGranularityAuthentication RequiredBest ForPitfalls
Per-User IDIndividualYesAll authenticated APIsNone (best option)
Per-API KeyPer credentialYes (key-based)Developer APIs, M2MKey leakage, shared keys
Per-IPIP-levelNoUnauthenticated endpointsNAT, CGN, proxies
Per-IP + User-AgentBetter than IPNoUnauthenticated + bot detectEasily spoofed
Per-IP Subnet (/24)Subnet-levelNoCGN, corporate networksWhole company impacted
GlobalSystem-wideNoInfrastructure protectionUnfair (one user affects all)

The Layered Identifier Strategy

def get_rate_limit_identifiers(request) -> list[tuple[str, int]]:
    """
    Return list of (identifier, limit) pairs.
    ALL must pass for the request to be allowed.
    Each provides a different layer of protection.
    """
    identifiers = []
    window = 60  # 1 minute window for all
 
    # Layer 1: Per-user (most specific, highest limit)
    user_id = extract_user_id(request)
    if user_id:
        tier_limit = get_tier_limit(user_id)
        identifiers.append((f"user:{user_id}", tier_limit))
 
    # Layer 2: Per-API-key (for machine clients)
    api_key = request.headers.get("X-API-Key")
    if api_key:
        identifiers.append((f"apikey:{hash_key(api_key)}", 5000))
 
    # Layer 3: Per-IP (coarse, protects against unauthenticated floods)
    ip = get_real_ip(request)
    ip_limit = 1000 if is_cgn_ip(ip) else 200  # higher for NAT
    identifiers.append((f"ip:{ip}", ip_limit))
 
    # Layer 4: Global (system-wide protection)
    identifiers.append(("global:system", 1_000_000))
 
    return identifiers

The Fallback Chain

Best case: Authenticated user -> rate limit by user ID
Good:      API key present -> rate limit by key hash
Fair:      No auth, regular IP -> rate limit by IP (lower limit)
Coarse:    No auth, CGN/corporate IP -> rate limit by IP with higher limit
Emergency: DDoS detected -> temporary geographic block at CDN

9. Fail-Open vs Fail-Closed: The Full Analysis

Decision Matrix by Endpoint Type

Endpoint TypeRisk of Fail-OpenRisk of Fail-ClosedRecommendation
Public read APILowMediumFail-Open
User dashboardLowMediumFail-Open
Search APILowMediumFail-Open
Payment/charge APIVery HighLowFail-Closed
Authentication/loginHighLowFail-Closed
Account creationHighMediumFail-Closed
Admin operationsHighLowFail-Closed
Password resetHighLowFail-Closed
File uploadMediumMediumLocal fallback
Report generationLowHigh (expensive)Fail-Open with quota
Internal health checkN/AVery High (monitoring blind)Always Open

The Local Fallback Strategy (Best of Both)

from enum import Enum
import time
 
class FailPolicy(Enum):
    OPEN = "open"        # Allow all requests when Redis down
    CLOSED = "closed"    # Deny all requests when Redis down
    LOCAL = "local"      # Use local rate limiter as fallback
 
class ResilientRateLimiter:
    """
    Three-mode rate limiter:
    - HEALTHY: Use Redis (accurate, distributed)
    - DEGRADED: Use local limiter (approximate, single-instance)
    - FAILED: Fail-open or fail-closed based on policy
    """
 
    def __init__(
        self,
        redis_limiter,
        local_limiter,
        policy: FailPolicy = FailPolicy.LOCAL,
        failure_threshold: int = 3,
        recovery_timeout: int = 30
    ):
        self.redis_limiter = redis_limiter
        self.local_limiter = local_limiter
        self.policy = policy
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.circuit_opened_at = None
 
    def is_allowed(self, identifier: str) -> dict:
        # Check if circuit is open (Redis failed recently)
        if self.circuit_opened_at:
            if time.time() - self.circuit_opened_at > self.recovery_timeout:
                # Try to recover
                self.circuit_opened_at = None
                self.failure_count = 0
            else:
                return self._handle_degraded(identifier)
 
        try:
            result = self.redis_limiter.is_allowed(identifier)
            self.failure_count = 0  # Reset on success
            return result
 
        except Exception as e:
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.circuit_opened_at = time.time()
                # Alert!
            return self._handle_degraded(identifier)
 
    def _handle_degraded(self, identifier: str) -> dict:
        if self.policy <mark class="obsidian-highlight"> FailPolicy.OPEN:
            return {"allowed": True, "mode": "fail_open", "reason": "redis_unavailable"}
 
        if self.policy </mark> FailPolicy.CLOSED:
            return {"allowed": False, "mode": "fail_closed", "reason": "redis_unavailable"}
 
        # LOCAL fallback
        local_result = self.local_limiter.is_allowed(identifier)
        local_result["mode"] = "local_fallback"
        local_result["note"] = "approximate_limits"
        return local_result

10. Rate Limiting Library vs Build Your Own

Library Options by Language

LanguageLibraryAlgorithmRedis-backedNotes
JavaBucket4jToken BucketYes (Lettuce/Jedis)Best Java option, production-grade
JavaResilience4j RateLimiterFixed WindowNoWorks with CB, annotation-driven
JavaGuava RateLimiterToken BucketNoGoogle Guava, local only
PythonFlask-LimiterConfigurableYes (Redis)Flask-specific
PythonslowapiConfigurableYes (Redis)FastAPI-native
PythonlimitsAll algorithmsYesFramework-agnostic
Node.jsexpress-rate-limitFixed/SlidingYes (RedisStore)Most popular
Gogolang.org/x/time/rateToken BucketNoStandard library, local only
GothrottledGCRAYes (Redis)Similar to Stripe's approach
Rubyrack-attackConfigurableYes (Redis)Rack middleware

Build vs Buy Decision

Use a library when:

  • Standard use case (fixed window, token bucket, sliding window counter)
  • One of the supported languages above
  • Speed to production matters
  • You want community support and battle-tested code
  • Library supports your required storage backend

Build your own when:

  • You need custom algorithm behavior not supported by any library
  • You need extreme performance optimization for your specific access patterns
  • You are integrating with a novel storage backend
  • You have compliance requirements that restrict open-source dependency usage
  • Custom rate limit logic (e.g., cost-based per-query complexity)

The real question: How much is custom logic?

Pure fixed window with Redis? -> Use library (Flask-Limiter / express-rate-limit)
Token bucket with Redis? -> Use Bucket4j or equivalent
Custom GraphQL cost analysis? -> Build the cost analyzer, use library for the bucket
Adaptive rate limiting with ML? -> Build it (no library does this)
Multi-dimensional with 5 limits? -> Build on top of library primitives

11. Infrastructure Cost Comparison

Scenario: 100,000 active users, 10,000 RPS peak, 100 requests/minute limit per user

Option A: Redis Standalone (AWS ElastiCache)

Redis instance: cache.m5.large ($0.10/hour = $72/month)
Memory needed: 100,000 users x 2 keys x 60 bytes = 12 MB
Redis CPU at 10K RPS: ~15% (well within limits)
Latency overhead: +2ms per request
Annual cost: ~$864

Option B: Redis Cluster (3 primary + 3 replica)

Redis cluster: 6x cache.t3.micro ($0.017/hour x 6 = $74/month)
Memory: 12 MB split across 3 primaries
CPU: 3% per shard
Latency: +2ms (same as standalone for this scale)
Annual cost: ~$888
Use when: Availability is critical, not when scale demands it at this size

Option C: DynamoDB On-Demand

Writes (INCR equivalent): 10,000 WCU/sec x $0.0000012 = $0.012/second = $864/month
Reads (0 - we don't read separately): $0
Storage: 12 MB x $0.25/GB/month = $0.003/month
Total: ~$864/month (similar to Redis but scales automatically)
Annual cost: ~$10,368 (vs $864 for Redis)
Use when: Serverless architecture, no Redis allowed, cost is secondary

Option D: Nginx Local (no Redis, per-IP only)

Additional infrastructure: $0 (uses existing Nginx)
Limitation: Per-IP only, no per-user limits
Use when: IP-based limits are sufficient (public site protection)
Annual cost: $0 additional

Cost Summary:

OptionMonthly CostLatencyAccuracyScales to 1M users?
Redis Standalone$72+2ms99.9%Yes (upgrade instance)
Redis Cluster$74+2ms99.9%Yes (add shards)
DynamoDB On-Demand$864+5ms99.9%Yes (auto)
Nginx Local$0+0msIP-onlyNo (IP-only)
In-Process$0+0msPer-instanceNo (N*limit)

12. Trade-Off Decision Trees

Decision Tree 1: Choosing an Algorithm

START: What is your primary constraint?

Accuracy first? (security, payment, compliance)
  -> Low user count (<10K)? -> Sliding Window Log
  -> High user count (>10K)? -> Sliding Window Counter + Redis Lua (99.9% accurate)

Memory first? (very large user base, many keys)
  -> Need burst support? -> Token Bucket or GCRA
  -> No burst needed? -> Fixed Window or Sliding Window Counter

Latency first? (every millisecond matters)
  -> Can you accept per-server limits? -> In-process Token Bucket
  -> Need distributed accuracy? -> Hybrid (local cache + Redis)

Burst support is required?
  -> Need smooth output to downstream? -> Token Bucket + Leaky Bucket combination
  -> Just need burst allowance? -> Token Bucket

Simplicity first? (prototype, internal tool)
  -> Fixed Window Counter

Decision Tree 2: Choosing Where to Implement

START: Who are your clients?

Unknown / public / unauthenticated?
  -> Add CDN/Edge rate limiting first (free with Cloudflare)
  -> Add Nginx per-IP limits as backup
  -> Add app-level per-user after they authenticate

Known developers / API customers?
  -> API Gateway (Kong or AWS API Gateway)
  -> Add app-level for business-logic limits

Internal services only?
  -> Service mesh (Istio/Envoy) for service-to-service
  -> App-level for user-facing

All of the above (typical production)?
  -> Layer CDN + Nginx + App-level (defense in depth)
  -> API Gateway only if developer portal is needed

Decision Tree 3: Fail Strategy

START: What happens if your rate limiter fails?

Is this endpoint security-critical?
  (login, payment, account creation, admin operations)
  -> Fail-Closed: better to be unavailable than exploited
  -> Alert and page on-call immediately

Is this endpoint user-facing but not security-critical?
  (dashboard, product pages, search, data queries)
  -> Local fallback limiter
  -> Fail-Open after local limiter exhausts
  -> Alert but don't page

Is this endpoint a health check or monitoring endpoint?
  -> Never rate limit at all
  -> Always fail-open (or exempt from rate limiting entirely)

Is this endpoint internal or infrastructure?
  -> Fail-Open with monitoring
  -> Internal services calling each other can handle brief over-limits

Summary Reference Card

ALGORITHM SELECTION:
  Default choice:   Sliding Window Counter (accuracy + memory balance)
  Need bursting:    Token Bucket
  Security-critical: Sliding Window Log (exact)
  Traffic shaping:  Leaky Bucket
  Memory-minimal:   GCRA (1 float per user)

PLACEMENT SELECTION:
  Public API protection:  CDN + Load Balancer
  User-level limits:      Application middleware + Redis
  Developer platform:     API Gateway
  Service-to-service:     Service Mesh

STORAGE SELECTION:
  Standard:        Redis (best balance)
  AWS Serverless:  DynamoDB
  Local only:      In-process (single instance only)

FAILURE STRATEGY:
  Security endpoints: Fail-Closed
  General endpoints:  Local fallback -> Fail-Open
  Health/monitoring:  Always Open (exempt from limiting)

IDENTIFIER PRIORITY:
  Best:      Authenticated User ID
  Good:      API Key hash
  Fair:      IP address (authenticated-aware limit)
  Emergency: Subnet or ASN-level

Next Supplement: Supplement 4 - Architecture Patterns and Decision-Making