Rate Limiting Demystified - Part 1: Fundamentals
Series Navigation:
Index |
Part 2 - Algorithms |
Part 3 - Implementation |
Part 4 - Distributed |
Part 5 - Advanced |
Part 6 - Interview Questions
Table of Contents
- What Is Rate Limiting?
- Why Rate Limiting Matters
- Core Terminology
- Types of Rate Limiting
- Rate Limiting vs Related Concepts
- HTTP Standards for Rate Limiting
- Where to Implement Rate Limiting
- Rate Limiting Granularity and Scope
- Soft Limits vs Hard Limits
- Inbound vs Outbound Rate Limiting
1. What Is Rate Limiting?
Rate limiting is the practice of controlling how many requests a client (user, IP address, API key,
or application) can make to a service within a defined time period.
Think of it like a highway toll booth. During rush hour, only a certain number of cars can pass
through per minute. If too many cars arrive at once, some must wait or are turned away. The highway
itself does not break - it simply enforces a flow limit.
The Everyday Analogy
Imagine a coffee shop with one barista. That barista can make 10 coffees per minute. If 100 people
walk in at the same moment and all order at once, the barista cannot keep up. The solution is:
- Put a limit on how many orders you accept per minute (rate limiting)
- Queue extra orders and process them later (throttling)
- Tell people to come back after a specific time (backoff)
- Close the shop temporarily under extreme load (circuit breaker)
Rate limiting is the first line of defense: control the input before the system is overwhelmed.
Formal Definition
Rate Limiting is a technique that restricts the number of requests a sender can make to a
receiver within a given time window. Requests that exceed the limit are rejected, delayed,
or queued depending on the policy.
2. Why Rate Limiting Matters
Rate limiting is not optional for production systems. Here is what it protects against:
2.1 Abuse and DoS/DDoS Prevention
Without rate limiting, a single malicious actor or a buggy client can flood your API with
thousands of requests per second. This is a Denial of Service (DoS) attack. Rate limiting
caps what a single source can do, making simple DoS attacks ineffective.
Real example: A competitor sends 50,000 requests/second to your public pricing API. Without
rate limiting, your database connection pool exhausts, your service becomes unavailable for
legitimate users.
2.2 Fair Resource Distribution
In a multi-tenant system, one heavy user should not degrade the experience for others. Rate
limiting enforces fairness - every user gets their fair share of system resources.
Real example: A SaaS platform has 10,000 customers sharing a database. One customer runs a
report that generates 1,000 API calls per second. Rate limiting ensures that customer does not
starve the other 9,999.
2.3 Cost Control
Cloud services charge per API call, per compute unit, or per database read. An uncontrolled client
(or your own code with a bug) can run up a massive bill in hours.
Real example: A developer accidentally deploys code with an infinite retry loop that hits
your database 10,000 times per second. Without rate limiting on the calling service, the AWS
bill for that month is catastrophic.
2.4 API Monetization and Tiering
Rate limiting is the technical enforcement mechanism behind paid API tiers. Free users get 100
calls/day, Pro users get 10,000, Enterprise users get unlimited. This business model only works
if rate limits are enforced.
Real example: GitHub's REST API gives unauthenticated users 60 requests/hour. Authenticated
users get 5,000. Enterprise contracts get custom limits.
2.5 Protecting Downstream Dependencies
Your API may call databases, third-party services, or internal microservices. If your own API
receives unlimited requests, it will relay that load to every downstream service. Rate limiting
at your API boundary protects your entire dependency chain.
2.6 SLA and Quality of Service
Rate limits help ensure that response time SLAs are met. If you know your service can handle
1,000 requests/second while maintaining p99 latency under 100ms, you rate limit at 1,000 RPS
to guarantee that SLA.
3. Core Terminology
Understanding these terms precisely is essential. Interviewers probe for exact definitions.
Rate
The number of requests per unit of time. Often expressed as:
- RPS: Requests Per Second
- RPM: Requests Per Minute
- RPH: Requests Per Hour
- RPD: Requests Per Day
Limit
The maximum allowed count within the window. If the limit is 100 and the window is 60 seconds,
the client can make at most 100 requests in any given 60-second period (depending on algorithm).
Window
The time period over which requests are counted. Common window types:
- Fixed (Tumbling) Window: A discrete, non-overlapping time block. 12:00:00 to 12:01:00 is
one window. 12:01:00 to 12:02:00 is the next. - Sliding Window: A continuous rolling window that moves with each request. "The last 60
seconds" from the current moment.
Burst
A short-term spike in traffic that exceeds the average rate but is still within acceptable
bounds. Burst handling is what separates token bucket from leaky bucket.
Example: Your limit is 100 RPM. A user normally sends 50 RPM. They accumulate "credit" and
at one moment send 80 requests in 5 seconds. Token bucket allows this burst because the average
is still within limit. Leaky bucket does not.
Quota
A longer-term limit, often measured in days or months. Quotas are different from rate limits:
- Rate limit: 100 requests/minute (controls speed)
- Quota: 10,000 requests/day (controls total volume)
A system can have both: "You can send up to 100 RPM, but no more than 10,000 per day total."
Throttle
The action taken when a limit is exceeded. Throttling can mean:
- Rejecting the request immediately (HTTP 429)
- Delaying the request (queuing/slowing it down)
- Degrading the response (returning cached or lower-quality data)
Backpressure
A signal sent from a downstream service to an upstream caller telling it to slow down. Unlike
rate limiting (which enforces on the receiver), backpressure is communicated back to the sender.
Rate limiting is a form of enforced backpressure.
Jitter
Intentional randomization added to retry timers. When all clients hit the rate limit at the
same time and all retry after exactly 60 seconds, they create a synchronized burst called a
"thundering herd." Jitter breaks this synchronization.
Idempotency Key
A unique identifier sent by the client to allow safe retries. If a request is rate limited and
the client retries, the server can use the idempotency key to detect it is the same logical
operation and not double-charge or double-process.
4. Types of Rate Limiting
4.1 User-Level Rate Limiting
Limits based on authenticated user identity. The most precise and fairest approach.
Key format: "rate_limit:user:{user_id}"
Example: "rate_limit:user:user_123456"
- Requires authentication to be in place
- Survives IP changes (mobile users on cellular networks change IPs constantly)
- Can be tied to subscription tiers
4.2 IP-Based Rate Limiting
Limits based on the client's IP address. The easiest to implement, often used as a first
line of defense even for unauthenticated endpoints.
Key format: "rate_limit:ip:{ip_address}"
Example: "rate_limit:ip:203.0.113.42"
Caution: IPv4 exhaustion has led to widespread use of NAT (Network Address Translation).
Many users behind a corporate firewall or mobile carrier share a single IP. Limiting by IP
can accidentally block many legitimate users. Also, attackers can rotate IPs easily.
Caution: When behind a load balancer or reverse proxy, always read the real client IP from
X-Forwarded-For or X-Real-IP headers, not the connection IP (which will be your proxy).
4.3 API Key-Based Rate Limiting
Limits tied to API credentials. Commonly used for machine-to-machine APIs and developer
platforms.
Key format: "rate_limit:apikey:{api_key_hash}"
- Hash the API key before using it as a Redis key (never store raw credentials)
- Different API keys can have different limits based on the account tier
- Enables key rotation without impacting the user
4.4 Endpoint-Level Rate Limiting
Different limits for different API endpoints based on their cost and sensitivity.
| Endpoint | Limit | Reason |
|---|---|---|
| GET /api/products | 1000/min | Read-only, cheap |
| POST /api/orders | 10/min | Writes, expensive |
| POST /api/auth/login | 5/min | Security - prevent brute force |
| POST /api/export | 2/hour | Very expensive operation |
| GET /api/search | 100/min | Moderate - search is expensive |
This is best practice: not all endpoints are equal. A search endpoint that queries Elasticsearch
should have a tighter limit than a simple GET of a cached product detail page.
4.5 Global (System-Wide) Rate Limiting
A ceiling on total requests to the entire system, regardless of source. Used to protect
infrastructure capacity.
Key: "rate_limit:global:system"
Limit: 50,000 RPS (because that's what infrastructure can handle)
This is a hard ceiling. Even if no single user is at their individual limit, if aggregate
traffic hits the global limit, new requests are rejected.
4.6 Geographic Rate Limiting
Limits based on country, region, or data center. Used for:
- Compliance (GDPR region restrictions)
- Cost optimization (traffic from specific regions is more expensive)
- Fraud detection (unusual traffic from a new geography)
4.7 Concurrent Connection Rate Limiting
Instead of counting requests over time, this limits how many simultaneous open requests or
connections a client can have. Useful for:
- WebSocket connections
- Long-polling endpoints
- File download/upload slots
# Nginx example: limit concurrent connections per IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
limit_conn conn_limit 10;
4.8 Compound/Multi-Dimensional Rate Limiting
Real production systems combine multiple limit types simultaneously:
User: 100 requests/minute AND 10,000 requests/day
Endpoint POST /api/export: 2 requests/hour per user
Global: 50,000 requests/second across all users
A request must pass ALL applicable limits to be accepted.
5. Rate Limiting vs Related Concepts
These concepts are frequently confused in interviews. Know the differences precisely.
5.1 Rate Limiting vs Throttling
| Aspect | Rate Limiting | Throttling |
|---|---|---|
| Action on excess | Reject (429) | Delay or slow down |
| Client experience | Gets an error | Gets a slower response |
| Queue | No | Yes (requests are queued) |
| Use case | Hard limits, abuse prevention | Smoothing traffic, prioritization |
| Example | "You can only send 100 RPS. Request 101 is rejected." | "You sent 100 RPS. Request 101 is queued for 100ms." |
In practice, the terms are often used interchangeably in conversation but they have
technically different behaviors.
5.2 Rate Limiting vs Circuit Breaker
| Aspect | Rate Limiting | Circuit Breaker |
|---|---|---|
| Triggered by | Incoming request count | Downstream failure rate |
| Purpose | Protect from too many requests | Protect from cascading failures |
| Direction | Inbound traffic control | Outbound call protection |
| Pattern | Counts requests over time | Monitors error rates / latencies |
| State | Stateless (per time window) | Stateful (CLOSED, OPEN, HALF-OPEN) |
| Example | "Client is sending too many requests" | "Database is failing 50% of calls, stop calling it" |
5.3 Rate Limiting vs Load Shedding
| Aspect | Rate Limiting | Load Shedding |
|---|---|---|
| Trigger | Per-client request count | Overall system overload |
| Granularity | Per user/IP/key | Entire request classes |
| Criteria | "Did this user exceed their quota?" | "Is system CPU > 90%? Drop lowest priority requests." |
| Implementation | Redis counter per client | System metrics + priority queue |
Load shedding is more drastic - it drops requests based on system health, not per-client behavior.
5.4 Rate Limiting vs Backpressure
| Aspect | Rate Limiting | Backpressure |
|---|---|---|
| Direction | Server tells client to slow down (push) | Downstream signals upstream to slow (pull) |
| Protocol | HTTP 429 response | Flow control signals (reactive streams, TCP window) |
| Where used | HTTP APIs | Message queues, stream processing, reactive systems |
| Example | REST API returns 429 | Kafka consumer tells producer to pause sending |
6. HTTP Standards for Rate Limiting
6.1 HTTP Status Codes
429 Too Many Requests (RFC 6585)
This is the correct status code when rate limit is exceeded. The response SHOULD include
a Retry-After header.
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 60
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689600
{
"error": "rate_limit_exceeded",
"message": "You have exceeded the rate limit of 100 requests per minute.",
"retry_after_seconds": 60
}503 Service Unavailable is sometimes used when global rate limits or circuit breakers trip,
but 429 is semantically more accurate for per-client rate limiting.
503 vs 429:
- Use 429 when a specific client is exceeding their rate limit
- Use 503 when the entire service is overwhelmed and cannot handle any requests
6.2 Standard Rate Limiting Response Headers
These headers inform clients about their current rate limit status:
| Header | Description | Example |
|---|---|---|
X-RateLimit-Limit | Total requests allowed in the current window | X-RateLimit-Limit: 100 |
X-RateLimit-Remaining | Requests remaining in the current window | X-RateLimit-Remaining: 73 |
X-RateLimit-Reset | Unix timestamp when the window resets | X-RateLimit-Reset: 1735689600 |
Retry-After | Seconds to wait before retrying (RFC 7231) | Retry-After: 60 |
X-RateLimit-Used | Number of requests used (GitHub style) | X-RateLimit-Used: 27 |
6.3 IETF RateLimit Header Field Draft
The IETF HTTP API working group has a draft standard for standardized rate limit headers
(draft-ietf-httpapi-ratelimit-headers). The goal is to replace vendor-specific headers with:
RateLimit-Limit: 100
RateLimit-Remaining: 73
RateLimit-Reset: 60The RateLimit-Policy header (from the draft) allows exposing policy details:
RateLimit-Policy: 100;w=60;burst=200;comment="sliding window"Most major APIs (GitHub, Stripe, Twitter) still use the
X-RateLimit-*convention.
The IETF draft is gaining adoption but is not yet universal.
6.4 Client Responsibilities
A well-behaved API client MUST:
- Read
X-RateLimit-Remainingon every response - Proactively slow down when remaining is low (not wait for 429)
- Respect
Retry-Afterheader when 429 is received - Implement exponential backoff with jitter for retries
- Use idempotency keys for safe retries on non-idempotent operations
6.5 Full Response Example
HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1735689660
X-RateLimit-Used: 153
X-RateLimit-Resource: coreHTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 3600
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689660
{
"message": "API rate limit exceeded for user 12345.",
"documentation_url": "https://docs.example.com/rate-limits"
}7. Where to Implement Rate Limiting
Rate limiting can be applied at multiple layers. Each layer has distinct trade-offs.
7.1 Layer Overview
[Client] --> [CDN/Edge] --> [Load Balancer] --> [API Gateway] --> [Application] --> [Database]
| | | | |
Client-side Geographic IP-based API Key-based Business logic
retry limits & coarse-grain & user-based fine-grain
logic DDoS rate limits rate limits rate limits
7.2 Client Side
Purpose: Respect upstream limits proactively. Prevent self-inflicted 429 errors.
Implementation:
- Track requests made and respect headers from server
- Implement exponential backoff with jitter
- Use request queuing libraries
Best for: SDK clients, service-to-service calls, batch processing jobs
Limitation: You cannot rely on clients to self-limit. Always enforce on the server side too.
7.3 CDN / Edge Layer (Cloudflare, Akamai, Fastly)
Purpose: Block volumetric attacks before they reach your origin. Geographic filtering.
Implementation:
- Cloudflare Rate Limiting rules: define rate, window, response
- Akamai Site Shield rules
- WAF (Web Application Firewall) rate limiting rules
Best for: DDoS protection, bot mitigation, geographic blocking
Limitation: Limited to IP/geographic granularity. Cannot see authenticated user identity.
7.4 Load Balancer (Nginx, HAProxy, AWS ALB)
Purpose: Coarse-grained IP-based rate limiting before requests hit your application servers.
Nginx example:
http {
limit_req_zone $binary_remote_addr zone=per_ip:10m rate=100r/m;
server {
location /api/ {
limit_req zone=per_ip burst=20 nodelay;
proxy_pass http://backend;
}
}
}Best for: First line of defense, protecting application servers from IP-based floods.
Limitation: No knowledge of business context (user tiers, subscription limits).
7.5 API Gateway (Kong, AWS API Gateway, Apigee, Envoy)
Purpose: User/API key-based rate limiting with full request context.
Implementation:
- AWS API Gateway: Usage Plans with throttle settings per stage/method
- Kong: rate-limiting plugin per route, service, or consumer
- Apigee: Quota and SpikeArrest policies
Best for: API-as-a-product platforms, monetization, multi-tenant SaaS
Limitation: Gateway is a shared component - limits become difficult to customize per user.
7.6 Application Layer (Your Code)
Purpose: Business-logic-aware rate limiting. User tier awareness, endpoint cost awareness.
Implementation:
- Spring Boot Filter or Interceptor
- Python middleware
- Node.js middleware
Best for: Fine-grained control, business rule enforcement, subscription tier enforcement
Limitation: Runs in every application instance, requires distributed state (Redis) to work
across instances.
7.7 Service Mesh (Istio, Envoy, Linkerd)
Purpose: Rate limiting for internal service-to-service communication.
Envoy example:
rate_limits:
- actions:
- request_headers:
header_name: x-user-id
descriptor_key: user_idBest for: Microservices architectures where individual services need protection from
other internal services (not just external clients).
7.8 Choosing Your Layer
| Scenario | Recommended Layer |
|---|---|
| Public API with monetization tiers | API Gateway + Application |
| Preventing DDoS from unknown IPs | CDN/Edge or Load Balancer |
| Per-user business logic limits | Application layer |
| Internal microservices | Service Mesh |
| Service calling a third-party API | Client-side |
| Quick and dirty protection | Load Balancer |
| Fine-grained control | Application layer with Redis |
Best practice: Implement rate limiting at MULTIPLE layers. CDN for DDoS, Load Balancer
for IP-based, Application for user-based. Defense in depth.
8. Rate Limiting Granularity and Scope
8.1 Time Window Granularity
| Granularity | Use Case | Example |
|---|---|---|
| Per second | Real-time APIs, streaming | 10 RPS for video frame API |
| Per minute | Standard REST APIs | 100 RPM for general use |
| Per hour | Batch operations | 500 requests/hour for reports |
| Per day | Quota enforcement | 10,000 requests/day for free tier |
| Per month | Billing cycles | 1,000,000 requests/month for paid tier |
8.2 Composite Limits
Production systems often stack multiple limits:
User "free_user_123":
- Max 10 requests/second (burst control)
- Max 100 requests/minute (normal rate)
- Max 5,000 requests/day (daily quota)
- Max 50,000 requests/month (monthly quota)
A request must pass ALL applicable limits. If any limit is exceeded, the request is rejected.
8.3 The "N requests per window" vs "N requests per rolling window" Distinction
This distinction matters for fairness:
Fixed window (N per window):
- Window 1: 12:00:00 - 12:01:00 -> allows 100 requests
- Window 2: 12:01:00 - 12:02:00 -> allows 100 requests
- Problem: 100 requests at 12:00:59 + 100 at 12:01:01 = 200 requests in 2 seconds
Sliding window (N per rolling window):
- "At any given moment, no more than 100 requests in the past 60 seconds"
- More fair but more complex to implement
9. Soft Limits vs Hard Limits
Hard Limits
Requests over the limit are always rejected immediately. No exceptions.
if request_count > limit:
return HTTP 429
Used when: Security boundaries, payment APIs, preventing abuse
Soft Limits
Requests over the limit are allowed up to a secondary threshold, or are allowed with
degraded quality of service.
if request_count > hard_limit:
return HTTP 429
elif request_count > soft_limit:
log warning, return response with warning header
response.header("X-RateLimit-Warning", "Approaching limit")
Used when: User experience is critical, you want to warn before cutting off,
or when the cost of blocking a legitimate user is high.
Grace Period
Allow a short-term burst above the limit before enforcing. A user normally at 90 RPM
is allowed to briefly hit 120 RPM, but sustained exceeding triggers enforcement.
10. Inbound vs Outbound Rate Limiting
This is a distinction many developers miss. Rate limiting applies in BOTH directions.
10.1 Inbound Rate Limiting (Server Side)
Protecting YOUR service from too many incoming requests.
[External Client] ---(too many requests?)---> [Your API] ---(protect)--> [Your DB]
This is what most people think of when they hear "rate limiting."
10.2 Outbound Rate Limiting (Client Side)
Controlling how many requests YOU send to a downstream service.
[Your Service] ---(respect their limits)---> [Stripe API]
[SendGrid API]
[OpenAI API]
[AWS Services]
If you call Stripe with 200 requests/second, they will rate limit you (429). This can
cascade: your API becomes slow/unavailable because your Stripe integration is 429-ing.
Outbound rate limiting implementation:
// Using Bucket4j to limit outbound calls to Stripe
public class StripeClient {
private final Bucket bucket;
private final StripeAPI stripeApi;
public StripeClient() {
// Stripe allows 100 requests/second per key
Bandwidth limit = Bandwidth.classic(100,
Refill.greedy(100, Duration.ofSeconds(1)));
this.bucket = Bucket4j.builder().addLimit(limit).build();
this.stripeApi = new StripeAPI();
}
public ChargeResponse createCharge(ChargeRequest request) {
// Block until a token is available (max wait: 500ms)
if (bucket.tryConsume(1, Duration.ofMillis(500))) {
return stripeApi.createCharge(request);
} else {
throw new RateLimitException("Outbound rate limit to Stripe exceeded");
}
}
}Key insight for interviews: Ask whether the rate limiting is inbound or outbound.
Many candidates only think about inbound (protecting their own service) and miss the
equally important outbound case (respecting third-party API limits).
Summary
| Concept | Key Takeaway |
|---|---|
| What it is | Control how many requests a client can make in a time window |
| Why it matters | Abuse prevention, fair usage, cost control, SLA enforcement |
| Core unit | Limit + Window + Action (reject/delay/queue) |
| Types | User, IP, API key, endpoint, global, geographic, concurrent |
| vs Throttling | Rate limiting rejects; throttling delays |
| vs Circuit Breaker | Rate limiting is inbound request count; CB is outbound failure rate |
| HTTP code | 429 Too Many Requests + Retry-After header |
| Where to implement | Multiple layers: Edge, LB, Gateway, Application, Service Mesh |
| Granularity | Per second to per month, often stacked as composite limits |
| Hard vs Soft | Hard = always reject; Soft = warn before cutting off |
| Inbound vs Outbound | Protect yourself AND respect others' limits |
Next: Part 2 - Rate Limiting Algorithms Deep Dive
Learn exactly how Fixed Window, Sliding Window, Token Bucket, Leaky Bucket, and GCRA work -
with code, visuals, and the precise trade-offs between them.