Rate Limiting Demystified - Part 6: Interview Questions

How to Use This Section
Section 1: Conceptual Questions (Most Frequently Asked)
Section 2: Algorithm Questions
Section 3: System Design Questions
Section 4: Coding Questions
Section 5: Tricky and Advanced Questions (2024-2026)
Cheat Sheet: Key Numbers and Formulas

How to Use This Section

Questions are ordered:

Section 1-2: Most frequently asked. Expect these in almost every interview.
Section 3: System design. Asked in senior/staff roles and dedicated SD rounds.
Section 4: Coding questions. Asked in coding rounds at FAANG-level companies.
Section 5: Tricky and advanced. Seen in 2024-2026 top-tier interviews.

For each question, the answer format is:

Core Answer: The must-know response. Never give less than this.
Go Deeper: Additional detail that separates a good answer from a great one.
Follow-up Questions: What the interviewer will likely ask next.

Section 1: Conceptual Questions (Q1-Q20)

Q1: What is rate limiting and why do we need it?

Core Answer:
Rate limiting controls how many requests a client (user, IP, API key) can make to a
service within a defined time window. Requests that exceed the limit are rejected with
HTTP 429.

We need it for five reasons:

Abuse prevention: Stop malicious actors from flooding the system
Fair usage: Prevent one heavy user from degrading others' experience
Cost control: Uncontrolled API calls can produce enormous cloud bills
SLA enforcement: Keep response times within promised bounds
Monetization: Enforce tiered access (free vs paid plans)

Go Deeper:
Rate limiting also protects downstream dependencies. If your API calls a database, a
rate limit at the API boundary prevents the database from being overwhelmed too.
Distinguish inbound rate limiting (protecting your service) from outbound rate limiting
(respecting third-party limits like Stripe or GitHub).

Follow-up Questions:

"What is the difference between rate limiting and throttling?"
"How does rate limiting differ from a circuit breaker?"
"What HTTP status code should a rate-limited response use?"

Q2: What HTTP status code should be returned when a rate limit is exceeded?

Core Answer:
429 Too Many Requests (defined in RFC 6585).

The response should include:

Retry-After header: how many seconds to wait before retrying
X-RateLimit-Limit: total requests allowed
X-RateLimit-Remaining: requests remaining (0 when 429)
X-RateLimit-Reset: Unix timestamp when the window resets

Go Deeper:
503 Service Unavailable is sometimes used for global rate limits or when the entire
service is overwhelmed, but 429 is semantically correct for per-client rate limiting.
The distinction: 429 = "you specifically are sending too many requests", 503 = "we as
a service cannot handle any requests right now."

Follow-up Questions:

"What is the difference between 429 and 503?"
"What should be in the response body?"
"How should a client handle a 429 response?"

Q3: Explain the difference between rate limiting, throttling, and circuit breaker.

Core Answer:

Concept	Action on Excess	Direction	Example
Rate Limiting	Reject (429)	Inbound requests	"You can send max 100 RPM"
Throttling	Delay/slow down	Inbound requests	"Extra requests wait in queue"
Circuit Breaker	Stop outgoing calls	Outbound calls	"DB failing 50% -> stop calling it"

Rate Limiting: Hard stop. Request 101 gets rejected immediately.
Throttling: Soft stop. Request 101 waits in a queue, gets processed when capacity allows.
Circuit Breaker: Monitors downstream failure rate. Opens circuit to prevent cascading failures.

Go Deeper:
The circuit breaker has three states: CLOSED (normal), OPEN (all calls blocked), HALF-OPEN
(testing recovery). Rate limiting is stateless per window; circuit breaker is stateful.

Follow-up Questions:

"When would you use throttling instead of rate limiting?"
"Can you have both a rate limiter and a circuit breaker in the same system?"
"What is the difference between circuit breaker and retry?"

Q4: What is the token bucket algorithm?

Core Answer:
Token bucket is a rate limiting algorithm where:

A bucket holds up to N tokens (the capacity)
Tokens are added at a constant rate R (e.g., 10 tokens/second)
Each request consumes 1 token (or more for expensive operations)
If enough tokens exist: request is allowed and tokens consumed
If not enough tokens: request is rejected
Tokens accumulate when traffic is low (allows burst)

Key properties:

Allows burst: if bucket is full (100 tokens), all 100 can be used instantly
Enforces average rate: sustained traffic cannot exceed the refill rate
Memory: O(1) per user (store tokens + last_refill_time)

Go Deeper:
The burst capacity and the rate are independently configurable. This separation is
powerful: capacity=100, rate=10/s allows users to burst 100 requests, but sustained
throughput is limited to 10 RPS. Used by AWS API Gateway, Stripe, and Nginx (burst param).

Follow-up Questions:

"How is token bucket different from leaky bucket?"
"How do you implement token bucket in Redis?"
"Why do you need a Lua script for atomic token bucket in Redis?"
"How would you implement cost-based rate limiting with token bucket?"

Q5: How is token bucket different from leaky bucket?

Core Answer:

Property	Token Bucket	Leaky Bucket
Burst support	YES - tokens accumulate and can be spent instantly	NO - output is always smooth
Output pattern	Bursty (depends on available tokens)	Smooth, constant rate
Use case	User-facing APIs (burst is natural)	Traffic shaping, protecting fragile DBs
When request arrives	Allowed immediately if token available	Added to queue, processed at fixed rate

Token bucket: "Do you have enough credit? Yes -> go now."
Leaky bucket: "Join the queue, you will be processed at our steady rate."

Go Deeper:
The leaky bucket's smooth output is ideal for protecting downstream services from
spikes. If your service calls a database that can handle exactly 100 QPS, a leaky
bucket ensures you never send more than 100 QPS even if your API receives 1,000 RPS.
Token bucket would relay the 1,000 RPS burst if the bucket was full.

Q6: What is the sliding window log algorithm and what is its main drawback?

Core Answer:
Sliding window log keeps a timestamp log of every request. When a new request arrives:

Remove all timestamps older than now - window_size
Count remaining timestamps
If count < limit: allow and add current timestamp
Otherwise: reject

Main drawback: Memory usage is O(limit) per user. With limit=1,000 and 1M users,
that is 8KB x 1M = 8GB just for rate limit logs. This is why it is rarely used at scale.

Go Deeper:
For very low-limit, high-accuracy use cases (like login brute force: limit=5), the
memory cost is negligible and the perfect accuracy is worth it. For general APIs with
limits of 1,000+, use the sliding window counter (hybrid approach) instead.

Q7: Explain the fixed window counter and its boundary problem.

Core Answer:
Fixed window divides time into discrete non-overlapping windows (e.g., each 60-second
block). A counter per user per window is incremented on each request. If counter > limit,
reject.

Boundary problem: A user can send 2x the limit in a short time by straddling two windows.

Example (limit = 100/minute):

Send 100 requests at 11:59:59 (last second of window 1 - all allowed)
Send 100 requests at 12:00:01 (first second of window 2 - all allowed)
Result: 200 requests in 2 seconds. Effective rate = 100x the intended 1 per second.

Go Deeper:
The boundary problem is real and exploitable in security contexts. Attackers who know
your window boundaries can time attacks to get 2x the limit. For security-sensitive
endpoints, use sliding window or token bucket instead.

Q8: What is the sliding window counter and how does it address the fixed window problem?

Core Answer:
The sliding window counter (hybrid approach) combines fixed window efficiency with
sliding window accuracy. It stores only two values: current window count and previous
window count. It uses a weighted formula to estimate the count in the true sliding window:

estimated_count = prev_count * (1 - elapsed/window_size) + curr_count

Where elapsed is how far we are into the current window.

This eliminates the boundary spike of fixed windows while using only O(1) memory.

Go Deeper:
The error margin is at most 1/limit (for a limit of 1,000, the error is 0.1%). This
is acceptable for general API rate limiting. For security-critical endpoints, the sliding
window log is more accurate. Cloudflare uses sliding window counter for its production
rate limiting infrastructure.

Follow-up Question: "What is the maximum error rate of the sliding window counter?"
Answer: At most (limit-1)/limit * 1/window_count. In practice, the error is negligible.

Q9: What are the standard rate limiting headers?

Core Answer:

X-RateLimit-Limit:     Total requests allowed in the window
X-RateLimit-Remaining: Requests remaining in the current window
X-RateLimit-Reset:     Unix timestamp when the window resets
Retry-After:           Seconds to wait before retrying (only on 429)

Newer IETF draft standard (gaining adoption):

RateLimit-Limit:     Total requests
RateLimit-Remaining: Remaining requests
RateLimit-Reset:     Seconds until reset

Go Deeper:
Always return these headers on EVERY response (200, 201, etc.), not just on 429.
Clients use them to proactively throttle before hitting the limit. Not returning them
on success responses means clients only discover their usage after they exceed it.

Q10: How do you choose the rate limit key (what to rate limit by)?

Core Answer:
Choose the most specific stable identifier available, in this preference order:

API Key (for machine-to-machine APIs) - most specific
Authenticated User ID (for user-facing APIs) - survives IP changes
IP Address (for unauthenticated endpoints) - coarser, can be spoofed/NATted
IP + User-Agent (for unauthenticated with more context) - slightly better than IP alone
Global (for overall system protection) - last resort

Go Deeper:
Avoid IP-only for user-facing authenticated APIs. Mobile users change IPs constantly.
Corporate offices can have thousands of employees behind one NAT IP. Rate limiting by IP
on an authenticated endpoint means an entire company gets blocked if one employee hits
the limit.

Q11: How would you handle rate limiting in a microservices architecture?

Core Answer:
Multiple options, often used in combination:

API Gateway (Kong, AWS API Gateway): Rate limit at the entry point before requests
reach any microservice. Best for external clients.
Service Mesh (Istio/Envoy): Rate limit inter-service calls. Each service can protect
itself from other services.
Shared Redis: All microservices reference the same Redis rate limit counters. The
user's limit is consumed regardless of which service handles the request.
Sidecar proxy: Each service has an Envoy sidecar that handles rate limiting.

Go Deeper:
In microservices, you must propagate the rate limit context (user ID, current count,
remaining budget) via request headers from service to service. Otherwise, each service
applies its own independent limit, multiplying the total allowed throughput.

Q12: What is the thundering herd problem in rate limiting?

Core Answer:
When a rate limit window resets (e.g., at the top of every minute), all clients who
were blocked simultaneously discover they can make requests again. They all retry at
the same moment, creating a synchronized traffic spike that can overwhelm the system.

Solution: Add jitter (randomization) to:

The retry timing on the client side (exponential backoff + random delay)
The window boundaries on the server side (stagger reset times per user using a hash)

Go Deeper:
The "full jitter" strategy is: wait = random(0, min(cap, 2^attempt * base_delay)).
This spreads retries across a time range rather than all hitting at the same moment.
AWS wrote an excellent blog post on this: "Exponential Backoff and Jitter" (2015).

Q13: What is the difference between a hard limit and a soft limit?

Core Answer:

Hard limit: Requests over the limit are ALWAYS rejected (HTTP 429). No exceptions.
Soft limit: Requests over the soft limit but under a secondary threshold are allowed
with a warning header. Only requests exceeding the hard limit are rejected.

Example: Soft limit = 80 RPM, Hard limit = 100 RPM.

Requests 81-100: Allowed but response includes X-RateLimit-Warning: Approaching limit
Requests 101+: Rejected with 429

Go Deeper:
Soft limits improve user experience by warning before cutting off. Useful for developer
APIs where the developer needs time to optimize their code. Hard limits are necessary
for security (login brute force) and resource protection.

Q14: What is an idempotency key and why does it matter for rate limiting?

Core Answer:
An idempotency key is a unique identifier sent by the client with a request to signal
"this is the same logical operation, even if you receive it multiple times."

When a request is rate limited (429), the client should retry. Without idempotency keys,
a retry of a non-idempotent operation (like a payment) would double-charge. With an
idempotency key, the server can detect it is a retry and return the same result without
re-executing.

Go Deeper:
Stripe's idempotency key is stored for 24 hours. If you retry a payment within 24 hours
using the same key, Stripe returns the original result. This makes it safe to retry on
any error including 429 without risk of double-charging.

Q15: How do you implement rate limiting for unauthenticated endpoints?

Core Answer:
For unauthenticated endpoints, the common approach is IP-based rate limiting:

Extract real client IP from X-Forwarded-For (first IP in the chain)
Use as the rate limit key: rl:ip:{ip_address}:{window}
Set conservative limits (e.g., 30-60 RPM)
Return standard 429 headers

Complications to address:

NAT: Multiple users sharing one IP. Set limits high enough not to penalize shared IPs.
IPv6: Many unique IPs. Some attackers rotate IPv6 addresses. Consider limiting by /64 subnet.
Proxies: X-Forwarded-For can be spoofed. Trust only from known proxy IPs.

Q16: What headers should you send in a 200 OK response (not just 429)?

Core Answer:
Always send rate limit headers on EVERY response:

HTTP/1.1 200 OK
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1735689660

Never send Retry-After on a 200 (only on 429).

Why this matters: Without headers on success responses, clients have no way to
proactively throttle themselves. They will discover their limit only when they get a 429.
Good clients read remaining tokens on every response and slow down when it gets low.

Q17: What is outbound rate limiting and why is it often overlooked?

Core Answer:
Outbound rate limiting controls how many requests YOUR service makes to EXTERNAL services.
This is distinct from inbound rate limiting (protecting your own service).

Why it matters: If your code calls Stripe at 500 RPS and Stripe allows 100 RPS, Stripe
will rate limit you (429). Your API will then fail (or return errors) because it cannot
complete the Stripe calls it depends on.

Implementation: Use a token bucket (Bucket4j in Java, or a simple local counter) that
controls the rate of outgoing requests to each external dependency.

Go Deeper:
In microservices, every inter-service call should also have outbound rate limiting. Service
A calling Service B without limits means Service B must rely solely on its own inbound
protection. Defense in depth: both caller and callee should have limits.

Q18: Explain fail-open vs fail-closed for rate limiting.

Core Answer:
When the rate limiter (Redis) becomes unavailable, what should happen to incoming requests?

Fail-Open: Allow all requests. Service stays available. Rate limits are temporarily
unenforced. Risk: abuse window during outage.
Fail-Closed: Deny all requests. Service becomes unavailable. Rate limits enforced.
Risk: legitimate users blocked due to infrastructure failure.

Recommendation: Fail-open for most APIs with local fallback rate limiting.
Fail-closed for security-critical endpoints (authentication, payment).

Go Deeper:
Best practice is to implement a circuit breaker around the Redis rate limiter. When Redis
fails, switch to a local in-memory rate limiter as a fallback. This provides approximate
rate limiting (not globally accurate, but better than unlimited) without taking the service
down.

Q19: What is the difference between per-request rate limiting and per-connection rate limiting?

Core Answer:

Per-request: Counts each HTTP request. Standard for REST APIs.
Per-connection: Limits simultaneous open TCP connections per client.

Per-connection limits are used for:

WebSocket connections (long-lived, can't be request-counted)
File downloads (limit concurrent download slots)
Long-polling endpoints

Example (Nginx):

limit_req_zone $ip zone=req_limit:10m rate=100r/m;   # Per-request
limit_conn_zone $ip zone=conn_limit:10m;               # Per-connection
limit_conn conn_limit 10;                               # Max 10 concurrent connections

Q20: How do rate limits interact with caching?

Core Answer:
Cached responses pose a question: should they count against the rate limit?

Options:

Count all requests (cached or not): Simple. Fair in terms of HTTP semantics.
Exempt cached responses: More fair - the user is not actually costing you resources.
Partial credit: Cached responses cost 0 tokens; fresh responses cost 1 token.

Recommendation: For proxy/CDN-level caching (like Cloudflare), exempt cache hits from
rate limits because they never reach your origin. For application-level caching (Redis cache
of DB results), it is simpler to count all requests because the cost of checking the cache
is non-zero.

Section 2: Algorithm Questions (Q21-Q35)

Q21: What algorithm does Stripe use for rate limiting?

Answer: GCRA (Generic Cell Rate Algorithm), via their open-source "Throttled" library.
GCRA uses a single timestamp per user (Theoretical Arrival Time - TAT) and provides
equivalent behavior to leaky bucket with minimum memory overhead.

Q22: What is the maximum error rate of the sliding window counter?

Answer: The sliding window counter (hybrid) has a maximum error of:

error <= (1 - overlap) * prev_count

In the worst case (at window transition with overlap near 0), the error approaches
1/limit (less than 0.1% for limit >= 1000). In practice, the error averages much lower.
This is why it is acceptable for most production APIs.

Q23: How much memory does each algorithm use per user?

Answer:

Algorithm	Memory Per User
Fixed Window Counter	O(1) - one integer
Sliding Window Log	O(limit) - up to `limit` timestamps
Sliding Window Counter	O(1) - two integers + one float
Token Bucket	O(1) - one float (tokens) + one float (last_refill)
Leaky Bucket (queue)	O(capacity) - up to capacity queued requests
GCRA	O(1) - one float (TAT)

Q24: Why must a Redis token bucket be implemented with a Lua script?

Answer: Without Lua, the token bucket requires multiple Redis commands:

HMGET (read current tokens and last_refill)
Calculate new_tokens (client-side)
HMSET (write new tokens)

Between steps 1 and 3, another request can read the same token count and both be allowed.
Lua scripts execute atomically on the Redis server - no other command can run between
statements. This eliminates the race condition entirely.

Q25: What is the boundary problem in fixed window rate limiting?

(Already covered in Q7 - if asked again, deliver the full answer with the 2-window example.)

Q26: How does GCRA calculate Theoretical Arrival Time (TAT)?

Answer:

emission_interval = 1 / rate  (time between perfectly spaced requests)
new_TAT = max(now, previous_TAT) + emission_interval

Request is allowed if: new_TAT <= now + burst_tolerance
where: burst_tolerance = burst_size * emission_interval

For rate=10 RPS and burst=5:

emission_interval = 100ms
burst_tolerance = 500ms
Up to 5 requests can arrive "early" (before their scheduled time)
6th rapid request: TAT=600ms, allowed_at=100ms, 100 > 0+500? No -> rejected

Q27: When would you use a leaky bucket instead of a token bucket?

Answer: Use leaky bucket when the OUTPUT rate must be perfectly smooth, not just
the average. Examples:

Your service calls a fragile downstream that can handle exactly 100 QPS (no spikes)
You are shaping traffic for network QoS
You are protecting a rate-limited payment processor from micro-bursts

Use token bucket when users should be allowed to burst occasionally (most user-facing APIs).

Q28: How do sorted sets (ZADD/ZREMRANGEBYSCORE) implement the sliding window log?

Answer:

ZADD key timestamp_as_score unique_request_id
ZREMRANGEBYSCORE key 0 (now - window_size)  -- remove old entries
ZCARD key                                    -- count current entries
EXPIRE key (window_size * 2)                -- auto-cleanup

Each request gets a unique member (UUID) with the current timestamp as its score.
Removing by score range evicts old entries. ZCARD gives the current count.

Memory: O(limit) per user (each entry stores ~50-100 bytes in sorted set).

Q29: What is the difference between tumbling window and sliding window?

Answer:

Tumbling (Fixed) Window: Non-overlapping, discrete time blocks. Resets at fixed intervals.
Example: "requests in minute 12:00-12:01". At 12:01, counter resets to 0.
Sliding Window: Continuous, overlapping window. Always looks back exactly N seconds.
Example: "requests in the last 60 seconds" from the current moment.

Tumbling windows are simpler but have the boundary problem. Sliding windows are more
accurate but more complex/memory-intensive.

Q30: How does Nginx implement rate limiting internally?

Answer: Nginx uses the Token Bucket algorithm for its limit_req module. The rate
parameter sets the refill rate. The burst parameter sets the bucket capacity. The
nodelay flag controls whether burst requests are processed immediately or spread over time.

limit_req_zone $ip zone=api:10m rate=10r/s;
location /api/ {
    limit_req zone=api burst=20 nodelay;
    # burst=20: up to 20 extra requests held
    # nodelay: process burst immediately (not spread over 2 seconds)
}

Q31: How do you implement rate limiting for WebSocket connections?

Answer: WebSocket connections are long-lived, so per-request counting doesn't apply
the same way. Options:

Connection-level limit: Max N concurrent WebSocket connections per user (limit_conn)
Message-level limit: Count messages sent over the WebSocket, apply token bucket
Handshake-level limit: Rate limit the initial HTTP upgrade request (before WS opens)

# Message-level rate limiting over WebSocket
class WebSocketRateLimiter:
    def __init__(self, messages_per_second: int):
        self.bucket = TokenBucket(
            capacity=messages_per_second * 10,  # 10 second burst
            refill_rate=messages_per_second
        )
 
    async def handle_message(self, user_id: str, message: str) -> bool:
        if not self.bucket.consume(user_id):
            await send_error(user_id, "Message rate limit exceeded")
            return False
        await process_message(user_id, message)
        return True

Q32: How would you rate limit GraphQL queries?

Answer: GraphQL has two specific challenges:

A single query can fetch massive amounts of data (N+1 problem)
Query depth can make a "single request" extremely expensive

Solutions:

Query cost analysis: Calculate query complexity points, deduct from token budget
Depth limiting: Reject queries deeper than N levels
Timeout-based limiting: Queries running longer than Xs are killed
Per-query field counting: Count resolved fields, not just queries

GitHub's approach: each query has a cost based on the number of connections requested.
Budget: 5,000 points/hour. The X-RateLimit-Cost header tells you what each query cost.

Q33: What is the difference between rate limiting and quota?

Answer:

Rate Limit: Controls the SPEED of requests (requests per unit of time)
Example: "100 requests per minute"
Quota: Controls the TOTAL VOLUME over a longer period
Example: "10,000 requests per month"

They are complementary:

Rate limit prevents short-term floods
Quota prevents accumulated over-usage

A free tier might have: rate limit = 30 RPM AND quota = 1,000/day AND quota = 10,000/month.
A request must pass ALL constraints.

Q34: How do you handle rate limiting for bulk operations?

Answer: Bulk operations (where one API call processes N items) should cost N tokens,
not 1 token.

@app.post("/api/bulk/users")
async def create_users_bulk(request: Request, users: list[UserCreate]):
    # Cost = number of users to create
    cost = len(users)
    result = rate_limiter.is_allowed(get_user_id(request), cost=cost)
    if not result["allowed"]:
        return 429, {"error": "Rate limit exceeded", "cost": cost}
    # Create users...

This prevents users from circumventing rate limits by batching requests.

Q35: What is GCRA and how does it compare to token bucket?

Answer:
GCRA (Generic Cell Rate Algorithm) is equivalent to token bucket but uses a single
stored value (TAT = Theoretical Arrival Time) instead of two (tokens + last_refill).

Token Bucket:          GCRA:
tokens: 7              TAT: 300ms
last_refill: 100ms     (TAT means "next request should arrive at 300ms to be on time")

Mathematically equivalent. GCRA uses 50% less storage per user. Less intuitive to
explain to stakeholders. Used by Stripe (via Throttled library).

Section 3: System Design Questions (Q36-Q50)

Q36: Design a rate limiter for a public REST API with 1 million users.

Answer Framework (use this structure in interviews):

Step 1: Clarify Requirements

Single region or multi-region?
Limits: per-user, per-IP, or both?
Algorithm preference or let me choose?
Latency budget: how much overhead is acceptable?
Consistency: exact or approximate?
Scale: 1M users, peak QPS?

Step 2: Component Choices

Algorithm: Sliding window counter (accuracy + O(1) memory per user)
Storage: Redis Cluster (shared state, horizontal scale, fast)
Implementation: Lua script for atomicity

Step 3: Data Model

Key: "rl:sw:{user_id}:{current_window_id}"
Key: "rl:sw:{user_id}:{previous_window_id}"
TTL: window_size * 2

Memory per user: 2 keys x ~70 bytes = ~140 bytes
Total for 1M users: 140MB (very manageable)

Step 4: Request Flow

Request -> API Gateway -> Extract user_id from JWT
        -> Execute Lua script in Redis
        -> If allowed: route to backend
        -> If denied: return 429 with headers

Step 5: Header Response
Always return X-RateLimit-Limit, Remaining, Reset on every response.

Step 6: Failure Handling
Redis sentinel for HA. Circuit breaker with local fallback if Redis fails.
Policy: fail-open (allow requests) when Redis is unreachable.

Scale Calculations:
1M users x average 10 RPS = 10M RPS at peak.
Redis can handle 100K-1M operations/second.
Need Redis Cluster with 10-100 nodes for this scale.
Consider pipelining to reduce round trips.

Q37: Design a distributed rate limiter for a system running on 100 servers.

Answer:
Key insight: in-memory counters per server won't work (each server only sees 1% of traffic).
You need shared state.

Design:

Centralized Redis Cluster as shared rate limit state
All 100 application servers connect to same Redis Cluster
Use hash tags {user_id} to ensure all keys for a user land on the same Redis shard
Use Lua scripts for atomic operations
Use Redis Sentinel or Cluster for high availability

Optimization: Hybrid Local + Global
At 100 servers x 10,000 RPS each = 1M RPS total:

Pure centralized: 1M Redis ops/second (feasible with cluster)
Hybrid: Each server reserves N/num_servers of the limit locally
- 90% of requests checked locally (no Redis call)
- 10% overflow checked globally in Redis
- Reduces Redis load by 90%

Tradeoff: Hybrid is slightly less accurate (could allow up to 2x limit briefly during
server startup/sync). Acceptable for general APIs.

Q38: How would you design rate limiting for a payment API?

Answer:
Payment APIs require stricter guarantees than general APIs:

Exact accuracy required: Use sliding window log (not sliding window counter).
The small error margin of sliding window counter is unacceptable for payments.
Lua script mandatory: All Redis operations must be atomic. No approximations.
Fail-closed for security: If Redis is unavailable, deny payment requests.
Never allow unlimited payments due to a rate limiter failure.
Idempotency keys: Clients must send idempotency keys for safe retries.
Store idempotency key results in Redis for 24 hours.
Per-operation limits: Different limits for different operations:
- GET /charges: 1,000/min (cheap read)
- POST /charges: 25/sec (expensive, fraud-risk)
- POST /refunds: 10/min (very sensitive)
Multi-dimensional limits:
- Per API key: 25 charges/second
- Per customer being charged: 3/minute (prevent charging same customer repeatedly)
- Global: 10,000 charges/second system-wide

Q39: How would you implement rate limiting in a serverless architecture (AWS Lambda)?

Answer:
Serverless creates unique challenges: no persistent memory between invocations.

Options:

API Gateway rate limiting: AWS API Gateway has built-in rate limiting (usage plans).
This is the first line of defense and doesn't require any Lambda code.
Redis (ElastiCache): Lambda functions connect to ElastiCache Redis for shared state.
Each invocation checks Redis. Watch out for connection pool exhaustion (Lambda can
spawn many instances, each needing a connection).
DynamoDB rate limiting: Use conditional writes for atomic counters.
```
UpdateItem with condition: count < limit
Increment count if condition passes, return result
```
Higher latency than Redis (~5-15ms vs ~0.5-2ms).
Lambda Power Tools Rate Limiter: AWS Lambda Powertools (Python/Java) includes
utilities that use DynamoDB or ElastiCache.

Connection Pool Issue in Lambda:
Lambda can have 1,000+ concurrent instances. Each holding a Redis connection = 1,000+
connections. Redis default max = 10,000 connections.
Solution: Use Redis connection pooling with low max_connections per Lambda, or use
ElastiCache Serverless which handles connection pooling automatically.

Q40: Design rate limiting for a streaming service like Netflix or YouTube.

Answer:

Streaming is special: a "request" is not a short HTTP call - it is an ongoing stream.

Key metrics to limit:

Concurrent streams per user: Limit to N simultaneous streams (Netflix: 1-4 based on plan)
Bandwidth per user: Limit megabits/second per user
Stream requests per minute: Limit how often a user can start new streams

Design:

Concurrent stream tracking:

When stream starts:
  INCR "streams:{user_id}" (atomic)
  If count > tier_limit: deny, DECR (undo)
  Else: allow, set TTL = session_duration + buffer

When stream ends:
  DECR "streams:{user_id}"

TTL handles crashed clients: key expires even if DECR never called

Bandwidth limiting: Implemented at the CDN/edge layer using token bucket per user.
Each chunk download costs tokens proportional to chunk size.

Q41: How would you handle rate limiting when users are behind a NAT?

Answer:
NAT (Network Address Translation) means thousands of corporate employees can share one
public IP. IP-based rate limits then unfairly block everyone when one user hits the limit.

Solutions:

Primary: Authenticated rate limiting. Require login. Rate limit by user ID, not IP.
This completely bypasses the NAT problem.
Unauthenticated endpoints: Higher IP limits. Set the IP limit higher (e.g., 1,000/min)
to accommodate shared IPs. Combine with global system protection.
IPv6 subnet limiting. Instead of per-/128 (individual IPv6), limit per-/64 subnet.
Each organization typically has a /48 or /64.
User-Agent + IP. More granular than IP alone, still not perfect (bots fake UA).
Progressive enhancement. Allow higher limits for authenticated users, lower for anonymous.
Incentivizes login which enables per-user limiting.

Q42: How do you handle clock skew in a distributed rate limiter?

Answer:
Fixed window rate limiting is time-based. If Server A's clock reads 12:00:00.000 and
Server B reads 12:00:00.800, they disagree on which window a request belongs to.

Solutions:

Use Redis SERVER time, not application server time.

-- In Lua script, use redis.call('TIME') to get Redis's server time
-- This ensures all rate limit checks use the same clock
local redis_time = redis.call('TIME')
local now = redis_time[1] + redis_time[2] / 1e6

NTP synchronization. Ensure all servers sync from the same NTP stratum.
Modern cloud environments (AWS, GCP) have sub-millisecond NTP drift.
Pass time as a parameter from the client.
Each application server passes its local time as an argument to the Lua script.
For fixed window, the window ID is floor(client_time / window_size).
Small clock skew affects which window a boundary-case request goes into, but the
impact is at most 1 extra request per window transition per drifting server.

Interview tip: Acknowledge that clock skew exists, explain it is a second-order
problem for most APIs (the impact is at most 1 request misattributed), and explain
the Redis time solution for critical cases.

Q43: How would you design a rate limiter that is also an API monetization engine?

Answer:
Rate limiting + quotas are the technical backbone of API monetization.

Design:

Subscription tiers stored in database: Each user has a tier (free/starter/pro).

Tier config cached in Redis or service:

rl:config:{user_id} -> {tier: "pro", rpm: 1000, rpd: 100000} (TTL: 300s)

Multiple limit enforcement:
- Per-second burst limit (token bucket)
- Per-minute limit (sliding window counter)
- Per-day quota (fixed window)
- Per-month quota (fixed window with monthly reset)
Quota exhaustion response:
Different from rate limiting - quota exhaustion means the user has no more requests
this billing period. Return 402 Payment Required or 429 with a longer Retry-After
(time until next billing period).
Soft quota warnings: At 80%, 90%, 95% usage, email the user and return a warning header.
Overage billing: Instead of hard quota cutoff, allow overage at a per-request price.

Section 4: Coding Questions (Q51-Q58)

Q51: Implement a token bucket rate limiter in Python.

import time
import threading
 
 
class TokenBucket:
    def __init__(self, capacity: int, refill_rate: float):
        """
        capacity: max tokens (max burst size)
        refill_rate: tokens added per second
        """
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens = float(capacity)
        self.last_refill = time.time()
        self.lock = threading.Lock()
 
    def _refill(self):
        now = time.time()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.refill_rate
        )
        self.last_refill = now
 
    def consume(self, tokens: float = 1.0) -> bool:
        with self.lock:
            self._refill()
            if self.tokens >= tokens:
                self.tokens -= tokens
                return True
            return False

Common follow-up: "Make this distributed using Redis."
-> Use the Lua script from Part 3 (token_bucket.lua).

Q52: Implement a sliding window counter.

import time
import threading
from collections import defaultdict
 
 
class SlidingWindowCounter:
    def __init__(self, limit: int, window_seconds: int):
        self.limit = limit
        self.window = window_seconds
        self.curr = defaultdict(int)    # current window count
        self.prev = defaultdict(int)    # previous window count
        self.curr_start = defaultdict(float)
        self.lock = threading.Lock()
 
    def is_allowed(self, key: str) -> bool:
        with self.lock:
            now = time.time()
            if key not in self.curr_start:
                self.curr_start[key] = now
 
            elapsed = now - self.curr_start[key]
            if elapsed >= self.window:
                self.prev[key] = self.curr[key]
                self.curr[key] = 0
                self.curr_start[key] = now
                elapsed = 0
 
            weight = (self.window - elapsed) / self.window
            estimate = self.prev[key] * weight + self.curr[key]
 
            if estimate < self.limit:
                self.curr[key] += 1
                return True
            return False

Q53: Implement a fixed window rate limiter using Redis in Python.

import redis
import time
 
 
def is_rate_limited(r: redis.Redis, user_id: str, limit: int, window: int) -> bool:
    now = int(time.time())
    window_id = now // window
    key = f"rl:{user_id}:{window_id}"
 
    count = r.incr(key)
    if count == 1:
        r.expire(key, window * 2)
 
    return count > limit  # True means rate limited
 
 
# Usage
r = redis.Redis(decode_responses=True)
for i in range(105):
    if is_rate_limited(r, "user_123", limit=100, window=60):
        print(f"Request {i+1}: DENIED")
    else:
        print(f"Request {i+1}: ALLOWED")

Q54: Write a Redis Lua script for atomic sliding window rate limiting.

-- Keys: [1] = rate limit key
-- Args: [1] = limit, [2] = window seconds, [3] = current timestamp
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local window_id = math.floor(now / window)
local prev_key = key .. ":" .. (window_id - 1)
local curr_key = key .. ":" .. window_id
local prev = tonumber(redis.call("GET", prev_key) or 0)
local curr = tonumber(redis.call("GET", curr_key) or 0)
local elapsed_frac = (now % window) / window
local estimate = prev * (1 - elapsed_frac) + curr
if estimate < limit then
    redis.call("INCR", curr_key)
    redis.call("EXPIRE", curr_key, window * 2)
    return {1, math.floor(estimate) + 1}
end
return {0, math.floor(estimate)}

Q55: Implement exponential backoff with full jitter.

import time
import random
import requests
 
 
def call_with_retry(
    url: str,
    max_retries: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0
) -> requests.Response:
 
    for attempt in range(max_retries + 1):
        response = requests.get(url)
 
        if response.status_code != 429:
            return response  # Success or non-retryable error
 
        if attempt == max_retries:
            response.raise_for_status()
 
        # Check Retry-After header first
        retry_after = response.headers.get("Retry-After")
        if retry_after:
            wait = float(retry_after)
        else:
            # Full jitter: random between 0 and 2^attempt * base_delay
            cap = min(max_delay, (2 ** attempt) * base_delay)
            wait = random.uniform(0, cap)
 
        print(f"Rate limited. Attempt {attempt + 1}. Waiting {wait:.2f}s.")
        time.sleep(wait)
 
    raise Exception("Max retries exceeded")

Q56: Implement a concurrent-safe rate limiter test that validates no race condition.

import threading
import time
import pytest
 
 
def test_no_race_condition():
    limiter = TokenBucket(capacity=10, refill_rate=0)  # No refill
    allowed_count = [0]
    lock = threading.Lock()
 
    def make_request():
        if limiter.consume():
            with lock:
                allowed_count[0] += 1
 
    threads = [threading.Thread(target=make_request) for _ in range(50)]
    for t in threads:
        t.start()
    for t in threads:
        t.join()
 
    assert allowed_count[0] == 10, (
        f"Expected exactly 10 allowed, got {allowed_count[0]}"
    )

Q57: Implement a rate limiter that supports multiple time windows simultaneously.

import redis
import time
 
 
class MultiWindowRateLimiter:
    """
    Enforces multiple rate limits at different granularities.
    All limits must pass for the request to be allowed.
    """
 
    SCRIPT = """
for i = 1, #KEYS do
    local limit = tonumber(ARGV[(i-1)*2 + 1])
    local window = tonumber(ARGV[(i-1)*2 + 2])
    local now = tonumber(ARGV[#ARGV])
    local win_id = math.floor(now / window)
    local k = KEYS[i] .. ':' .. win_id
    local c = redis.call('INCR', k)
    if c == 1 then redis.call('EXPIRE', k, window * 2) end
    if c > limit then
        -- Undo the increment for this key and return denied
        redis.call('DECR', k)
        return {0, i, limit, c - 1}
    end
end
return {1, 0, 0, 0}
"""
 
    def __init__(self, r: redis.Redis, windows: list[tuple[int, int]]):
        """windows: list of (limit, window_seconds)"""
        self.r = r
        self.windows = windows
        self._script = r.register_script(self.SCRIPT)
 
    def is_allowed(self, identifier: str) -> bool:
        now = int(time.time())
        keys = [f"rl:{identifier}:{w}" for _, w in self.windows]
        args = []
        for limit, window in self.windows:
            args.extend([limit, window])
        args.append(now)
 
        result = self._script(keys=keys, args=args)
        return bool(int(result[0]))
 
 
# Usage
r = redis.Redis(decode_responses=True)
limiter = MultiWindowRateLimiter(r, windows=[
    (10, 1),       # 10 per second
    (100, 60),     # 100 per minute
    (5000, 3600),  # 5000 per hour
])

Q58: How would you test that a rate limiter handles Redis failure gracefully?

import pytest
from unittest.mock import patch, MagicMock
import redis
 
 
def test_fail_open_on_redis_timeout():
    mock_redis = MagicMock()
    mock_redis.execute_script.side_effect = redis.TimeoutError("Redis timeout")
 
    limiter = ResilientRateLimiter(
        redis_limiter=RedisRateLimiter(mock_redis, limit=10, window=60),
        local_limiter=LocalRateLimiter(limit=10, window=60),
        policy="fail-open"
    )
 
    # Should NOT raise an exception even if Redis is down
    result = limiter.is_allowed("user123")
    assert result["allowed"] is True  # Fail open: allow the request
    assert result["fallback"] is True  # Indicate we're in fallback mode
 
 
def test_fail_closed_on_redis_timeout_for_payments():
    mock_redis = MagicMock()
    mock_redis.execute_script.side_effect = redis.TimeoutError("Redis timeout")
 
    limiter = ResilientRateLimiter(
        redis_limiter=RedisRateLimiter(mock_redis, limit=10, window=60),
        local_limiter=None,
        policy="fail-closed"
    )
 
    result = limiter.is_allowed("user123")
    assert result["allowed"] is False  # Fail closed: deny the request
    assert result["reason"] == "rate_limiter_unavailable"

Section 5: Tricky and Advanced Questions (Q59-Q80)

Q59: A user has a 100 RPM limit. They send 50 requests at 11:59:30 and 80 at 12:00:10. How many are allowed (fixed window)? How many (sliding window counter)?

Answer:

Fixed Window:

11:59:30 (window 11:59:00-12:00:00): 50 requests -> all 50 allowed (50 < 100)
12:00:10 (window 12:00:00-12:01:00): 80 requests -> all 80 allowed (80 < 100)
Total allowed: 130 in about 40 seconds! (2x the intended limit)

Sliding Window Counter:
At 12:00:10 (10 seconds into the current minute):

elapsed_fraction = 10/60 = 0.167
prev_weight = 1 - 0.167 = 0.833
estimate = 50 * 0.833 + curr_count_before_this_batch
estimate = 41.65 + 0 = 41.65
Remaining budget = 100 - 41.65 = 58.35, so only 58 of the 80 are allowed

Result: Sliding window counter allows 50 + 58 = 108 requests.
Fixed window allows 130. Sliding window is significantly more accurate.

Q60: How do you rate limit a GraphQL API where the same query can cost 1 to 10,000 points?

Answer:

Parse the query before execution to calculate complexity
Use cost-based token bucket where each request deducts cost tokens (not 1 token)
If query cost > available tokens: reject with 429 before executing
Include X-RateLimit-Cost: {cost} header in every response

Follow-up: "What if the query complexity can only be known after execution?"
Answer: For truly dynamic queries, use a timeout-based approach. Set a maximum execution
time (e.g., 5 seconds). Track actual execution time per user. Users who consistently run
expensive queries get their time budget reduced. This is called "CPU-second" based limiting.

Q61: How would you prevent a DDoS attack that rotates through millions of unique API keys?

Answer:
API key rotation attacks create keys faster than you can block them. Per-key rate limiting
alone fails because each new key starts fresh.

Multi-layer defense:

IP-level blocking at CDN/edge (Cloudflare/Akamai): Block IPs before they reach your origin
IP-based rate limiting in addition to key-based: Even with new keys, same IP is limited
Velocity detection: Flag accounts that create many API keys rapidly (>5 keys/minute = suspicious)
CAPTCHA for key creation: Require CAPTCHA or email verification for new API keys
Device fingerprinting: Use browser fingerprints for web clients, TLS fingerprints for non-web
Anomaly detection: ML-based detection of unusual patterns across all keys from same IP

Q62: Your rate limiter uses Redis. Redis experiences a split-brain scenario. What happens?

Answer:
In Redis split-brain (during network partition), you have two Redis nodes that both think
they are the primary. Both accept writes. Rate limit counters diverge.

Consequences:

Users may be allowed 2x their limit (each partition sees only its half of requests)
When the partition heals and nodes reconcile, there is no automatic conflict resolution
(Redis uses Last-Write-Wins)

Mitigations:

Redis Cluster with majority quorum: Cluster requires majority of primary nodes to
agree before accepting writes. A partitioned minority refuses writes.
Redis Sentinel: Automatically fails over to a single primary when the existing one fails.
Prevents split-brain if sentinel quorum is properly configured.
Accept the inconsistency: For general APIs, a brief window of 2x traffic during
a Redis partition is acceptable. Log it and alert, but don't crash the service.
Fail-closed during partition detection: If a client cannot reach quorum, fail-closed.

Q63: How does Cloudflare rate limit at 1 million requests per second with sub-millisecond latency?

Answer:
This is an architectural question about scale. Cloudflare's approach:

Data plane rate limiting (not control plane): Rate limit checks happen in the network
data path (Nginx/Rust workers), before any application code runs.
Local approximate counting with periodic sync: Each PoP (Point of Presence) maintains
its own local counter. Counters are synchronized globally every few seconds.
Sliding window counter (O(1) per user): Extremely memory-efficient. No sorted sets,
no logs.
Hardware-accelerated networking: XDP/eBPF for packet-level rate limiting, bypassing the
kernel network stack entirely.
Geographically distributed enforcement: Each region enforces limits locally. A user
trying to DDoS from multiple regions hits regional limits in each region.
Custom Rust implementation: Not off-the-shelf Redis. Custom in-memory data structures
optimized for the specific access patterns.

Key insight: At Cloudflare scale, Redis is too slow. The rate limiter must be in-process
(or at most same-machine) to achieve sub-millisecond decisions.

Q64: A microservice has a rate limiter. How do you ensure the rate limit counts across retries from an upstream service?

Answer:
If Service A retries a call to Service B 3 times, Service B sees 3 requests. Service B's
rate limiter counts all 3 against Service A's quota. This is correct behavior.

To make retries not count: Use idempotency keys:

Service A generates a unique idempotency key for the original operation
Service A sends the same key on each retry
Service B rate limits idempotency keys, not individual requests
Same idempotency key = same logical request = counted only once

Redis implementation:

SETNX "idem:{key}" 1 EX 300   -- SET if Not eXists, expire in 5 min
If SET succeeded: it's a new request, rate limit and process
If SET failed: it's a retry, return the cached result (idempotent)

Q65: How would you implement rate limiting for a service that has both human users and machine clients with different behavior patterns?

Answer:
Human and machine clients have fundamentally different traffic patterns:

Humans: bursty, irregular, low sustained rate (interactions driven by UI)
Machines: regular, high-frequency, predictable

Design:

Classify clients: API key type flag (human-facing vs service-to-service)
Different algorithm per type:
- Humans: Token bucket (allows burst, forgiving of irregular patterns)
- Machines: Leaky bucket or strict sliding window (smooth, predictable)
Different limits per type:
- Human session: 1,000 RPH, burst of 50
- Machine service: 10,000 RPM, no burst allowed (smooth output to downstream)

Q66: What are the rate limiting considerations for gRPC services?

Answer:
gRPC is different from REST:

Streaming calls: A single gRPC call can be a long-lived stream with many messages.
Rate limit on messages, not just connections.
Multiplexed connections: Multiple gRPC calls over a single HTTP/2 connection.
Connection-level limits don't map to request limits.
Service discovery: gRPC clients maintain persistent connections to multiple servers.
Sticky sessions break gRPC's load balancing.

Approach:

Use Envoy sidecar for gRPC rate limiting (native gRPC support)
Rate limit at the gRPC interceptor level (server-side unary and streaming interceptors)
Use metadata headers for rate limit context

// gRPC server interceptor for rate limiting
public class RateLimitInterceptor implements ServerInterceptor {
    @Override
    public <Q, R> ServerCall.Listener<Q> interceptCall(
        ServerCall<Q, R> call,
        Metadata headers,
        ServerCallHandler<Q, R> next
    ) {
        String userId = headers.get(USER_ID_KEY);
        if (!rateLimiter.isAllowed(userId)) {
            call.close(Status.RESOURCE_EXHAUSTED
                .withDescription("Rate limit exceeded"), new Metadata());
            return new ServerCall.Listener<>() {};
        }
        return next.startCall(call, headers);
    }
}

Q67: How would you implement "shadow mode" for a new rate limit policy?

Answer:
Shadow mode (also called "dry run" or "observability mode") runs the new rate limit
policy but does not enforce it. Violations are logged, not acted upon.

class ShadowRateLimiter:
    def __init__(self, primary_limiter, shadow_limiter, shadow_only: bool = True):
        self.primary = primary_limiter
        self.shadow = shadow_limiter
        self.shadow_only = shadow_only  # True = shadow mode, False = enforce
 
    def is_allowed(self, identifier: str) -> bool:
        primary_result = self.primary.is_allowed(identifier)
        shadow_result = self.shadow.is_allowed(identifier)
 
        # Log when shadow and primary disagree
        if primary_result != shadow_result:
            logger.info(
                "shadow_rate_limit_disagreement",
                identifier=identifier,
                current_policy=primary_result,
                new_policy=shadow_result
            )
 
        if self.shadow_only:
            return primary_result  # Enforce current policy only
        else:
            return shadow_result   # Switch to new policy

Workflow:

Deploy new rate limit policy in shadow mode
Monitor logs for disagreements (would-have-been-denied vs allowed)
Analyze impact: how many users would be affected?
Tune the new policy based on data
Flip shadow_only = False to start enforcing new policy

Q68: How does Twitter's API rate limiting differ between v1.1 and v2?

Answer:
Twitter v1.1 (legacy):

Free, unlimited API access (before 2023)
Per-application rate limits per 15-minute windows
No financial tiers

Twitter v2 (current, post-2023 restructuring):

Free: 1,500 Tweets/month write, extremely limited read (essentially read-locked)
Basic ($100/month): 10,000 read requests/month
Pro ($5,000/month): 1,000,000 reads/month
Enterprise: Custom pricing

Key changes:

Moved from count-based to cost-based (each tweet "costs" points)
Separate tiers for read vs write
Significantly reduced free tier (from unlimited to almost nothing)
This change in 2023 broke thousands of apps that depended on free API access

Interview lesson: API rate limit policies are business decisions, not technical ones.
The technical implementation must support dynamic policy changes without re-deployment.

Q69: How do you handle rate limiting for webhooks that your service sends to customer endpoints?

Answer:
Webhooks are OUTBOUND from your service to your customers' endpoints. You are the sender,
not the receiver. But you still need rate limiting:

Per-customer endpoint: Limit webhooks to each customer's endpoint
(e.g., 100/minute). Their server may not handle more.
Retry with backoff: If their endpoint returns 429 or 5xx, retry with exponential backoff.
Dead letter queue: If webhooks keep failing after N retries, move to dead letter queue.
Alert the customer.
Circuit breaker per endpoint: If an endpoint fails >50% of calls for 5 minutes, open
the circuit. Stop sending for 10 minutes. Test with probe. This prevents endlessly hammering
a dead customer endpoint.
Fan-out limiting: If one event triggers webhooks to 10,000 customers, rate limit the
fan-out to avoid creating a self-inflicted DDoS on yourself (outbound HTTP connections).

Q70: Design rate limiting for an AI/LLM API (like OpenAI) where different models have different costs.

Answer:
LLM APIs have unique characteristics:

Request processing time varies from 100ms to 60+ seconds
Cost is proportional to token count (input + output)
Output tokens are unknown until generation completes
Different models have vastly different per-token costs

Design:

Pre-deduct estimated tokens: When request arrives, deduct estimated tokens from bucket.

Estimated input tokens = actual (known upfront)
Estimated output tokens = max_tokens parameter
Pre-deduct: input_tokens + max_output_tokens

Reconcile on completion: After generation, refund unused tokens.

Actual output tokens = 150 (max was 500)
Refund 350 tokens: bucket += 350

Model-specific costs:
```
gpt-4: 1 token = 10 credit units
gpt-3.5-turbo: 1 token = 1 credit unit
```
Users have a "credit" budget. Different models consume at different rates.
Concurrent request limits: Limit parallel requests per user (e.g., 5 concurrent
long-running generation requests) to prevent queue flooding.
Hard token limit per request: Reject requests with max_tokens above the user's
tier limit to prevent single requests consuming entire budgets.

Q71: What is the "retry storm" problem and how do rate limiting design prevents it?

Answer:
When a large number of clients get rate limited (429) simultaneously and all retry after
the same interval, they create a synchronized burst that overwhelms the server all over
again. This repeats cyclically. This is the retry storm (a variant of the thundering herd).

How it happens:

1,000 clients all hit the rate limit at 12:00:00
All get Retry-After: 60
At 12:01:00, all 1,000 retry simultaneously
System is immediately overwhelmed again
Cycle repeats

Prevention in rate limiter design:

Jittered Retry-After: Return a Retry-After with slight randomization.

base_retry = window_reset_time - now
jitter = random.uniform(0, base_retry * 0.1)  # up to 10% random
retry_after = int(base_retry + jitter)

Queue with virtual waiting time: Instead of Retry-After=60 for everyone, give each
client a different Retry-After based on their position in a virtual queue.
Staggered window resets: Use user-ID-based window offsets so not all users reset
at the same second.

Q72: How do you rate limit by "user intent" or "action type" rather than just request count?

Answer:
Traditional rate limiting counts HTTP requests. But a user can accomplish the same goal
with very different request counts depending on their efficiency. Intent-based limiting
focuses on the action, not the transport.

Example: An e-commerce site

User A: adds 10 items to cart (10 API calls, 10 requests counted)
User B: uses the bulk add endpoint (1 API call, 1 request counted)
Both performed the same logical "action" but A counts as 10x

Intent-based approach:

RATE_LIMIT_ACTIONS = {
    "add_to_cart":         (10, 60),   # 10 add-to-cart actions per minute
    "checkout_attempt":    (3, 300),   # 3 checkout attempts per 5 minutes
    "password_reset":      (3, 3600),  # 3 password resets per hour
    "send_message":        (100, 60),  # 100 messages per minute
    "api_data_read":       (1000, 60), # 1000 data reads per minute
}
 
@app.post("/api/cart/items")
async def add_to_cart(request: Request, item: CartItem):
    # Rate limit on the INTENT (add_to_cart) not the HTTP method
    if not action_rate_limiter.is_allowed(user_id, "add_to_cart"):
        return 429, {"error": "Too many cart additions"}
    # process...

Q73: You have 10 Redis nodes in a cluster. A user's rate limit key hashes to node 3. Node 3 is slow (high latency). How do you handle this?

Answer:
This is a "hot shard" problem in Redis Cluster.

Immediate mitigation:

Timeout and fail-open: Set Redis timeout to 10ms. If node 3 takes >10ms, fail-open
(allow the request) and log the slow operation. This maintains service availability.
Read from replica: Configure Redis to allow stale reads from replicas.
Rate limit reads from replicas, writes to primary. Accepts minor inconsistency.

Root cause solutions:

Investigate node 3: Is it CPU-bound? Memory pressure causing evictions? Network issue?
Rebalance cluster: Migrate some hash slots from node 3 to other nodes.
Virtual nodes: Redis Cluster uses 16,384 hash slots. Redistribute more evenly.
Per-request timeout: Each rate limit check has a 5ms deadline. Slow node gets skipped.

Q74: How would you implement "penalty box" rate limiting where abusive users get progressively stricter limits?

Answer:
Penalty box is an adaptive system where repeated violations trigger progressively
stricter limits for repeat offenders.

import redis
import time
 
 
class PenaltyBoxRateLimiter:
    """
    Progressive rate limiting:
    - Normal behavior: standard limits
    - 1st violation: 50% reduction, 1-hour penalty
    - 2nd violation: 75% reduction, 24-hour penalty
    - 3rd violation: 90% reduction, 7-day penalty
    - Persistent violations: account suspension
    """
 
    PENALTY_LEVELS = [
        {"reduction": 0.5, "duration": 3600},      # 1st: 50% for 1 hour
        {"reduction": 0.25, "duration": 86400},    # 2nd: 25% for 1 day
        {"reduction": 0.1, "duration": 604800},    # 3rd: 10% for 1 week
        {"reduction": 0.0, "duration": 2592000},   # 4th: blocked for 30 days
    ]
 
    def __init__(self, r: redis.Redis, base_limit: int, window: int):
        self.r = r
        self.base_limit = base_limit
        self.window = window
 
    def get_effective_limit(self, user_id: str) -> int:
        penalty_level = int(self.r.get(f"penalty:{user_id}") or 0)
        if penalty_level == 0:
            return self.base_limit
        if penalty_level >= len(self.PENALTY_LEVELS):
            return 0  # Suspended
        reduction = self.PENALTY_LEVELS[penalty_level - 1]["reduction"]
        return int(self.base_limit * reduction)
 
    def record_violation(self, user_id: str) -> int:
        penalty_key = f"penalty:{user_id}"
        new_level = int(self.r.incr(penalty_key))
        if new_level <= len(self.PENALTY_LEVELS):
            duration = self.PENALTY_LEVELS[new_level - 1]["duration"]
            self.r.expire(penalty_key, duration)
        return new_level

Q75: How do you coordinate rate limiting across a blue-green deployment during a rollout?

Answer:
During blue-green deployment, you have two versions of the service running simultaneously.
Both versions must enforce the same rate limits against the same user buckets.

Solution:

Both blue and green connect to the SAME Redis rate limit store.
Rate limit keys are user-centric (not server-centric): rl:{user_id}:...
Requests from blue and green both decrement the same user bucket
No coordination needed between blue and green code - Redis is the coordination layer

Potential issue during rollout:
If blue uses a different rate limit algorithm or key format than green, users on blue
and users on green have separate (inconsistent) buckets during the transition.

Fix: Version your rate limit keys:

rl:v2:{user_id}:{window}

When green deploys, it uses v2 keys. Blue still uses v1 keys. After cutover, migrate
counters from v1 to v2 if needed, or simply let v1 expire naturally.

Q76: How do you implement fair queuing when you need to delay (not reject) rate-limited requests?

Answer:
Instead of returning 429 immediately, some systems queue excess requests and process
them when capacity allows. This is throttling (leaky bucket behavior) rather than
pure rate limiting.

import asyncio
import time
 
 
class FairQueueRateLimiter:
    """
    Queue excess requests and process at a steady rate.
    Provides virtual wait times without holding real threads.
    """
 
    def __init__(self, rate_per_second: float, max_queue_size: int = 1000):
        self.rate = rate_per_second
        self.interval = 1.0 / rate_per_second
        self.max_queue = max_queue_size
        self.queued = 0
        self.lock = asyncio.Lock()
        self.last_processed = time.time()
 
    async def acquire(self) -> float:
        """
        Returns the wait time before this request can be processed.
        Raises QueueFullError if queue is at capacity.
        """
        async with self.lock:
            if self.queued >= self.max_queue:
                raise QueueFullError("Rate limit queue is full")
 
            now = time.time()
            # When will this request be processed?
            next_slot = max(now, self.last_processed + self.interval * (self.queued + 1))
            self.last_processed = next_slot
            self.queued += 1
            wait = next_slot - now
            return wait
 
    async def release(self):
        async with self.lock:
            self.queued -= 1
 
 
# Usage
limiter = FairQueueRateLimiter(rate_per_second=100)
 
async def handle_request(request):
    try:
        wait_time = await limiter.acquire()
        if wait_time > 0:
            await asyncio.sleep(wait_time)
        # Process request
        result = await process(request)
        await limiter.release()
        return result
    except QueueFullError:
        return Response(status=429, body={"error": "Queue full"})

Q77: What is the difference between per-second and per-minute rate limits from a user experience perspective?

Answer:
This is a UX question, not a technical one. Interviewers ask it to test holistic thinking.

Per-second limits (e.g., 10 RPS):

Prevents instantaneous flooding
Users with bursty apps hit limits immediately
Frustrating for legitimate users who have a burst of 20 requests at once
Best for: network-level DoS protection, protecting fragile backends

Per-minute limits (e.g., 100 RPM):

More user-friendly for normal usage patterns
A burst of 100 requests at once can happen within the first second of the minute
Best for: general developer APIs, user-facing APIs

Recommendation: Use BOTH together:

Per-second limit: 20 RPS (burst protection)
Per-minute limit: 100 RPM (sustained rate)
Per-hour limit: 2000 RPH (quota protection)

The per-second limit prevents catastrophic floods. The per-minute limit enforces the
average rate. Communicating this to users: "100 requests per minute, maximum 20 per second."

Q78: How do you handle a customer who legitimately needs 10x their rate limit for a one-time data migration?

Answer:
This is a real scenario that operations teams face regularly.

Technical solutions:

Temporary limit override: Store overrides in Redis or database:

# Override key takes precedence over default tier limit
# "rl:override:user_123" -> {limit: 10000, expires: 1735689600}

Time-windowed exception: Apply elevated limit only during the migration window.
Separate API key for migration: Create a special API key with elevated limits,
valid only for the migration period. Revoke afterward.
Batch API endpoint: Build a bulk/batch API endpoint that processes thousands
of items in one API call, counting as one request.

Process:

Customer submits a request explaining their use case and timeline
Operations reviews and approves
Temporary override is applied programmatically (no code deployment)
Customer completes migration
Override expires automatically

Key insight: Rate limit overrides must be auditable and automatically expiring.
"Temporary" overrides that never expire become permanent security holes.

Q79: What happens if a user sends a request exactly at the rate limit (the 100th request when limit is 100)?

Answer:
This question tests understanding of boundary conditions.

With INCR: The 100th INCR returns 100. count <= limit where both are 100 is TRUE.
The 100th request is ALLOWED. The 101st INCR returns 101. 101 <= 100 is FALSE. Rejected.

With sliding window counter: The estimate is a float. At exactly the limit:

if estimated_count < self.limit:  # Strict less-than
    # 100 < 100 is False -> REJECT the 100th request

vs:

if estimated_count <= self.limit:  # Less-than-or-equal
    # 100 <= 100 is True -> ALLOW the 100th request

The boundary behavior depends on implementation. Document whether the limit is inclusive
or exclusive. Standard behavior: limit is inclusive (100 requests/minute means the 100th
is allowed, the 101st is not).

Q80: How would you design a "credit system" rate limiter where unused credits roll over to the next period?

Answer:
Rollover credits change the behavior significantly. Instead of "100/minute, hard reset",
users accumulate unused credits.

class RolloverCreditRateLimiter:
    """
    Credits roll over up to a maximum reserve.
    Example: 100 credits/minute, max reserve = 300
    - User uses 50 credits in minute 1 -> earns 100, uses 50, carries 50
    - Minute 2: starts with 150 credits (50 rollover + 100 new)
    - Minute 3: uses 250 credits in one burst (250 < 300 max)
    """
 
    def __init__(
        self,
        credits_per_window: int,
        window_seconds: int,
        max_reserve: int,
        redis_client
    ):
        self.credits_per_window = credits_per_window
        self.window = window_seconds
        self.max_reserve = max_reserve
        self.r = redis_client
        # Implementation: similar to token bucket but with larger capacity
        # capacity = max_reserve, refill_rate = credits_per_window / window_seconds
        self.bucket = TokenBucketRedis(
            r=redis_client,
            capacity=max_reserve,
            refill_rate=credits_per_window / window_seconds
        )
 
    def consume(self, user_id: str, cost: int = 1) -> bool:
        return self.bucket.is_allowed(user_id, cost=cost)["allowed"]

Key insight: This is equivalent to a token bucket with max_reserve as the capacity
and credits_per_window / window_seconds as the refill rate. The token bucket naturally
accumulates unused credits up to the capacity.

Cheat Sheet: Key Numbers and Formulas

Algorithm Selection Quick Reference

Need burst support?          -> Token Bucket or GCRA
Need smooth output?          -> Leaky Bucket
Need O(1) memory?            -> Fixed Window, Sliding Window Counter, Token Bucket, GCRA
Need exact accuracy?         -> Sliding Window Log
Most common production use?  -> Sliding Window Counter or Token Bucket

Memory Formulas

Fixed Window Counter:   1 int per user per window = ~8 bytes
Sliding Window Log:     limit * 8 bytes per user = up to 8KB (limit=1000)
Sliding Window Counter: 2 ints + 1 float = ~20 bytes per user
Token Bucket:           2 floats = ~16 bytes per user
GCRA:                   1 float = ~8 bytes per user

Sliding Window Counter Formula

prev_weight = (window_size - elapsed) / window_size
estimate = prev_count * prev_weight + curr_count
allowed = estimate < limit

Exponential Backoff with Full Jitter

wait = random(0, min(max_delay, base_delay * 2^attempt))

Redis Key Patterns

Fixed window:           rl:fw:{user_id}:{window_id}
Sliding window counter: rl:sw:{user_id}:{window_id} (curr and prev)
Token bucket:           rl:tb:{user_id}  (hash with tokens + last_refill)
Sliding window log:     rl:swl:{user_id} (sorted set)
Cluster-safe:           rl:fw:{{{user_id}}}:{window_id}  (hash tag)

HTTP Rate Limit Headers Quick Reference

Request headers (from client):     none required
Response headers (from server, on ALL responses):
  X-RateLimit-Limit:     Total allowed
  X-RateLimit-Remaining: Remaining
  X-RateLimit-Reset:     Unix timestamp of reset
  X-RateLimit-Used:      Requests used (GitHub style)
  Retry-After:           Seconds to wait (only on 429)

Real-World Limits Reference

GitHub REST:   60/hour (unauth), 5000/hour (auth)
GitHub GraphQL: 5000 points/hour
Stripe:        100 RPS live, unlimited test
Twitter v2:    Varies by tier (1500 writes/month free)
AWS API GW:    10000 RPS account default

End of Rate Limiting Demystified Series

Return to Series Index for an overview of all parts.

Series: Rate Limiting Demystified

Rate Limiting Demystified - Part 6: Interview Questions

Table of Contents

How to Use This Section

Section 1: Conceptual Questions (Q1-Q20)

Q1: What is rate limiting and why do we need it?

Q2: What HTTP status code should be returned when a rate limit is exceeded?

Q3: Explain the difference between rate limiting, throttling, and circuit breaker.

Q4: What is the token bucket algorithm?

Q5: How is token bucket different from leaky bucket?

Q6: What is the sliding window log algorithm and what is its main drawback?

Q7: Explain the fixed window counter and its boundary problem.

Q8: What is the sliding window counter and how does it address the fixed window problem?

Q9: What are the standard rate limiting headers?

Q10: How do you choose the rate limit key (what to rate limit by)?

Q11: How would you handle rate limiting in a microservices architecture?

Q12: What is the thundering herd problem in rate limiting?

Q13: What is the difference between a hard limit and a soft limit?

Q14: What is an idempotency key and why does it matter for rate limiting?

Q15: How do you implement rate limiting for unauthenticated endpoints?

Q16: What headers should you send in a 200 OK response (not just 429)?

Q17: What is outbound rate limiting and why is it often overlooked?

Q18: Explain fail-open vs fail-closed for rate limiting.

Q19: What is the difference between per-request rate limiting and per-connection rate limiting?

Q20: How do rate limits interact with caching?

Section 2: Algorithm Questions (Q21-Q35)

Q21: What algorithm does Stripe use for rate limiting?

Q22: What is the maximum error rate of the sliding window counter?

Q23: How much memory does each algorithm use per user?

Q24: Why must a Redis token bucket be implemented with a Lua script?

Q25: What is the boundary problem in fixed window rate limiting?

Q26: How does GCRA calculate Theoretical Arrival Time (TAT)?

Q27: When would you use a leaky bucket instead of a token bucket?

Q28: How do sorted sets (ZADD/ZREMRANGEBYSCORE) implement the sliding window log?

Q29: What is the difference between tumbling window and sliding window?

Q30: How does Nginx implement rate limiting internally?

Q31: How do you implement rate limiting for WebSocket connections?

Q32: How would you rate limit GraphQL queries?

Q33: What is the difference between rate limiting and quota?

Q34: How do you handle rate limiting for bulk operations?

Q35: What is GCRA and how does it compare to token bucket?

Section 3: System Design Questions (Q36-Q50)

Q36: Design a rate limiter for a public REST API with 1 million users.

Q37: Design a distributed rate limiter for a system running on 100 servers.

Q38: How would you design rate limiting for a payment API?

Q39: How would you implement rate limiting in a serverless architecture (AWS Lambda)?

Q40: Design rate limiting for a streaming service like Netflix or YouTube.

Q41: How would you handle rate limiting when users are behind a NAT?

Q42: How do you handle clock skew in a distributed rate limiter?

Q43: How would you design a rate limiter that is also an API monetization engine?

Section 4: Coding Questions (Q51-Q58)

Q51: Implement a token bucket rate limiter in Python.

Q52: Implement a sliding window counter.

Q53: Implement a fixed window rate limiter using Redis in Python.

Q54: Write a Redis Lua script for atomic sliding window rate limiting.

Q55: Implement exponential backoff with full jitter.

Q56: Implement a concurrent-safe rate limiter test that validates no race condition.

Q57: Implement a rate limiter that supports multiple time windows simultaneously.

Q58: How would you test that a rate limiter handles Redis failure gracefully?

Section 5: Tricky and Advanced Questions (Q59-Q80)

Q59: A user has a 100 RPM limit. They send 50 requests at 11:59:30 and 80 at 12:00:10. How many are allowed (fixed window)? How many (sliding window counter)?

Q60: How do you rate limit a GraphQL API where the same query can cost 1 to 10,000 points?

Q61: How would you prevent a DDoS attack that rotates through millions of unique API keys?

Q62: Your rate limiter uses Redis. Redis experiences a split-brain scenario. What happens?

Q63: How does Cloudflare rate limit at 1 million requests per second with sub-millisecond latency?

Q64: A microservice has a rate limiter. How do you ensure the rate limit counts across retries from an upstream service?

Q65: How would you implement rate limiting for a service that has both human users and machine clients with different behavior patterns?

Q66: What are the rate limiting considerations for gRPC services?

Q67: How would you implement "shadow mode" for a new rate limit policy?

Q68: How does Twitter's API rate limiting differ between v1.1 and v2?

Q69: How do you handle rate limiting for webhooks that your service sends to customer endpoints?

Q70: Design rate limiting for an AI/LLM API (like OpenAI) where different models have different costs.

Q71: What is the "retry storm" problem and how do rate limiting design prevents it?

Q72: How do you rate limit by "user intent" or "action type" rather than just request count?

Q73: You have 10 Redis nodes in a cluster. A user's rate limit key hashes to node 3. Node 3 is slow (high latency). How do you handle this?

Q74: How would you implement "penalty box" rate limiting where abusive users get progressively stricter limits?

Q75: How do you coordinate rate limiting across a blue-green deployment during a rollout?

Q76: How do you implement fair queuing when you need to delay (not reject) rate-limited requests?

Q77: What is the difference between per-second and per-minute rate limits from a user experience perspective?

Q78: How do you handle a customer who legitimately needs 10x their rate limit for a one-time data migration?