← Back to Articles
6/6/2026Admin Post

rate limiting part1 fundamentals

Rate Limiting Demystified - Part 1: Fundamentals

Series Navigation:
Index |
Part 2 - Algorithms |
Part 3 - Implementation |
Part 4 - Distributed |
Part 5 - Advanced |
Part 6 - Interview Questions


Table of Contents

  1. What Is Rate Limiting?
  2. Why Rate Limiting Matters
  3. Core Terminology
  4. Types of Rate Limiting
  5. Rate Limiting vs Related Concepts
  6. HTTP Standards for Rate Limiting
  7. Where to Implement Rate Limiting
  8. Rate Limiting Granularity and Scope
  9. Soft Limits vs Hard Limits
  10. Inbound vs Outbound Rate Limiting

1. What Is Rate Limiting?

Rate limiting is the practice of controlling how many requests a client (user, IP address, API key,
or application) can make to a service within a defined time period.

Think of it like a highway toll booth. During rush hour, only a certain number of cars can pass
through per minute. If too many cars arrive at once, some must wait or are turned away. The highway
itself does not break - it simply enforces a flow limit.

The Everyday Analogy

Imagine a coffee shop with one barista. That barista can make 10 coffees per minute. If 100 people
walk in at the same moment and all order at once, the barista cannot keep up. The solution is:

  • Put a limit on how many orders you accept per minute (rate limiting)
  • Queue extra orders and process them later (throttling)
  • Tell people to come back after a specific time (backoff)
  • Close the shop temporarily under extreme load (circuit breaker)

Rate limiting is the first line of defense: control the input before the system is overwhelmed.

Formal Definition

Rate Limiting is a technique that restricts the number of requests a sender can make to a
receiver within a given time window. Requests that exceed the limit are rejected, delayed,
or queued depending on the policy.


2. Why Rate Limiting Matters

Rate limiting is not optional for production systems. Here is what it protects against:

2.1 Abuse and DoS/DDoS Prevention

Without rate limiting, a single malicious actor or a buggy client can flood your API with
thousands of requests per second. This is a Denial of Service (DoS) attack. Rate limiting
caps what a single source can do, making simple DoS attacks ineffective.

Real example: A competitor sends 50,000 requests/second to your public pricing API. Without
rate limiting, your database connection pool exhausts, your service becomes unavailable for
legitimate users.

2.2 Fair Resource Distribution

In a multi-tenant system, one heavy user should not degrade the experience for others. Rate
limiting enforces fairness - every user gets their fair share of system resources.

Real example: A SaaS platform has 10,000 customers sharing a database. One customer runs a
report that generates 1,000 API calls per second. Rate limiting ensures that customer does not
starve the other 9,999.

2.3 Cost Control

Cloud services charge per API call, per compute unit, or per database read. An uncontrolled client
(or your own code with a bug) can run up a massive bill in hours.

Real example: A developer accidentally deploys code with an infinite retry loop that hits
your database 10,000 times per second. Without rate limiting on the calling service, the AWS
bill for that month is catastrophic.

2.4 API Monetization and Tiering

Rate limiting is the technical enforcement mechanism behind paid API tiers. Free users get 100
calls/day, Pro users get 10,000, Enterprise users get unlimited. This business model only works
if rate limits are enforced.

Real example: GitHub's REST API gives unauthenticated users 60 requests/hour. Authenticated
users get 5,000. Enterprise contracts get custom limits.

2.5 Protecting Downstream Dependencies

Your API may call databases, third-party services, or internal microservices. If your own API
receives unlimited requests, it will relay that load to every downstream service. Rate limiting
at your API boundary protects your entire dependency chain.

2.6 SLA and Quality of Service

Rate limits help ensure that response time SLAs are met. If you know your service can handle
1,000 requests/second while maintaining p99 latency under 100ms, you rate limit at 1,000 RPS
to guarantee that SLA.


3. Core Terminology

Understanding these terms precisely is essential. Interviewers probe for exact definitions.

Rate

The number of requests per unit of time. Often expressed as:

  • RPS: Requests Per Second
  • RPM: Requests Per Minute
  • RPH: Requests Per Hour
  • RPD: Requests Per Day

Limit

The maximum allowed count within the window. If the limit is 100 and the window is 60 seconds,
the client can make at most 100 requests in any given 60-second period (depending on algorithm).

Window

The time period over which requests are counted. Common window types:

  • Fixed (Tumbling) Window: A discrete, non-overlapping time block. 12:00:00 to 12:01:00 is
    one window. 12:01:00 to 12:02:00 is the next.
  • Sliding Window: A continuous rolling window that moves with each request. "The last 60
    seconds" from the current moment.

Burst

A short-term spike in traffic that exceeds the average rate but is still within acceptable
bounds. Burst handling is what separates token bucket from leaky bucket.

Example: Your limit is 100 RPM. A user normally sends 50 RPM. They accumulate "credit" and
at one moment send 80 requests in 5 seconds. Token bucket allows this burst because the average
is still within limit. Leaky bucket does not.

Quota

A longer-term limit, often measured in days or months. Quotas are different from rate limits:

  • Rate limit: 100 requests/minute (controls speed)
  • Quota: 10,000 requests/day (controls total volume)

A system can have both: "You can send up to 100 RPM, but no more than 10,000 per day total."

Throttle

The action taken when a limit is exceeded. Throttling can mean:

  1. Rejecting the request immediately (HTTP 429)
  2. Delaying the request (queuing/slowing it down)
  3. Degrading the response (returning cached or lower-quality data)

Backpressure

A signal sent from a downstream service to an upstream caller telling it to slow down. Unlike
rate limiting (which enforces on the receiver), backpressure is communicated back to the sender.
Rate limiting is a form of enforced backpressure.

Jitter

Intentional randomization added to retry timers. When all clients hit the rate limit at the
same time and all retry after exactly 60 seconds, they create a synchronized burst called a
"thundering herd." Jitter breaks this synchronization.

Idempotency Key

A unique identifier sent by the client to allow safe retries. If a request is rate limited and
the client retries, the server can use the idempotency key to detect it is the same logical
operation and not double-charge or double-process.


4. Types of Rate Limiting

4.1 User-Level Rate Limiting

Limits based on authenticated user identity. The most precise and fairest approach.

Key format: "rate_limit:user:{user_id}"
Example:    "rate_limit:user:user_123456"
  • Requires authentication to be in place
  • Survives IP changes (mobile users on cellular networks change IPs constantly)
  • Can be tied to subscription tiers

4.2 IP-Based Rate Limiting

Limits based on the client's IP address. The easiest to implement, often used as a first
line of defense even for unauthenticated endpoints.

Key format: "rate_limit:ip:{ip_address}"
Example:    "rate_limit:ip:203.0.113.42"

Caution: IPv4 exhaustion has led to widespread use of NAT (Network Address Translation).
Many users behind a corporate firewall or mobile carrier share a single IP. Limiting by IP
can accidentally block many legitimate users. Also, attackers can rotate IPs easily.

Caution: When behind a load balancer or reverse proxy, always read the real client IP from
X-Forwarded-For or X-Real-IP headers, not the connection IP (which will be your proxy).

4.3 API Key-Based Rate Limiting

Limits tied to API credentials. Commonly used for machine-to-machine APIs and developer
platforms.

Key format: "rate_limit:apikey:{api_key_hash}"
  • Hash the API key before using it as a Redis key (never store raw credentials)
  • Different API keys can have different limits based on the account tier
  • Enables key rotation without impacting the user

4.4 Endpoint-Level Rate Limiting

Different limits for different API endpoints based on their cost and sensitivity.

EndpointLimitReason
GET /api/products1000/minRead-only, cheap
POST /api/orders10/minWrites, expensive
POST /api/auth/login5/minSecurity - prevent brute force
POST /api/export2/hourVery expensive operation
GET /api/search100/minModerate - search is expensive

This is best practice: not all endpoints are equal. A search endpoint that queries Elasticsearch
should have a tighter limit than a simple GET of a cached product detail page.

4.5 Global (System-Wide) Rate Limiting

A ceiling on total requests to the entire system, regardless of source. Used to protect
infrastructure capacity.

Key: "rate_limit:global:system"
Limit: 50,000 RPS (because that's what infrastructure can handle)

This is a hard ceiling. Even if no single user is at their individual limit, if aggregate
traffic hits the global limit, new requests are rejected.

4.6 Geographic Rate Limiting

Limits based on country, region, or data center. Used for:

  • Compliance (GDPR region restrictions)
  • Cost optimization (traffic from specific regions is more expensive)
  • Fraud detection (unusual traffic from a new geography)

4.7 Concurrent Connection Rate Limiting

Instead of counting requests over time, this limits how many simultaneous open requests or
connections a client can have. Useful for:

  • WebSocket connections
  • Long-polling endpoints
  • File download/upload slots
# Nginx example: limit concurrent connections per IP
limit_conn_zone $binary_remote_addr zone=conn_limit:10m;
limit_conn conn_limit 10;

4.8 Compound/Multi-Dimensional Rate Limiting

Real production systems combine multiple limit types simultaneously:

User: 100 requests/minute AND 10,000 requests/day
Endpoint POST /api/export: 2 requests/hour per user
Global: 50,000 requests/second across all users

A request must pass ALL applicable limits to be accepted.


These concepts are frequently confused in interviews. Know the differences precisely.

5.1 Rate Limiting vs Throttling

AspectRate LimitingThrottling
Action on excessReject (429)Delay or slow down
Client experienceGets an errorGets a slower response
QueueNoYes (requests are queued)
Use caseHard limits, abuse preventionSmoothing traffic, prioritization
Example"You can only send 100 RPS. Request 101 is rejected.""You sent 100 RPS. Request 101 is queued for 100ms."

In practice, the terms are often used interchangeably in conversation but they have
technically different behaviors.

5.2 Rate Limiting vs Circuit Breaker

AspectRate LimitingCircuit Breaker
Triggered byIncoming request countDownstream failure rate
PurposeProtect from too many requestsProtect from cascading failures
DirectionInbound traffic controlOutbound call protection
PatternCounts requests over timeMonitors error rates / latencies
StateStateless (per time window)Stateful (CLOSED, OPEN, HALF-OPEN)
Example"Client is sending too many requests""Database is failing 50% of calls, stop calling it"

5.3 Rate Limiting vs Load Shedding

AspectRate LimitingLoad Shedding
TriggerPer-client request countOverall system overload
GranularityPer user/IP/keyEntire request classes
Criteria"Did this user exceed their quota?""Is system CPU > 90%? Drop lowest priority requests."
ImplementationRedis counter per clientSystem metrics + priority queue

Load shedding is more drastic - it drops requests based on system health, not per-client behavior.

5.4 Rate Limiting vs Backpressure

AspectRate LimitingBackpressure
DirectionServer tells client to slow down (push)Downstream signals upstream to slow (pull)
ProtocolHTTP 429 responseFlow control signals (reactive streams, TCP window)
Where usedHTTP APIsMessage queues, stream processing, reactive systems
ExampleREST API returns 429Kafka consumer tells producer to pause sending

6. HTTP Standards for Rate Limiting

6.1 HTTP Status Codes

429 Too Many Requests (RFC 6585)
This is the correct status code when rate limit is exceeded. The response SHOULD include
a Retry-After header.

HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 60
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689600
 
{
    "error": "rate_limit_exceeded",
    "message": "You have exceeded the rate limit of 100 requests per minute.",
    "retry_after_seconds": 60
}

503 Service Unavailable is sometimes used when global rate limits or circuit breakers trip,
but 429 is semantically more accurate for per-client rate limiting.

503 vs 429:

  • Use 429 when a specific client is exceeding their rate limit
  • Use 503 when the entire service is overwhelmed and cannot handle any requests

6.2 Standard Rate Limiting Response Headers

These headers inform clients about their current rate limit status:

HeaderDescriptionExample
X-RateLimit-LimitTotal requests allowed in the current windowX-RateLimit-Limit: 100
X-RateLimit-RemainingRequests remaining in the current windowX-RateLimit-Remaining: 73
X-RateLimit-ResetUnix timestamp when the window resetsX-RateLimit-Reset: 1735689600
Retry-AfterSeconds to wait before retrying (RFC 7231)Retry-After: 60
X-RateLimit-UsedNumber of requests used (GitHub style)X-RateLimit-Used: 27

6.3 IETF RateLimit Header Field Draft

The IETF HTTP API working group has a draft standard for standardized rate limit headers
(draft-ietf-httpapi-ratelimit-headers). The goal is to replace vendor-specific headers with:

RateLimit-Limit: 100
RateLimit-Remaining: 73
RateLimit-Reset: 60

The RateLimit-Policy header (from the draft) allows exposing policy details:

RateLimit-Policy: 100;w=60;burst=200;comment="sliding window"

Most major APIs (GitHub, Stripe, Twitter) still use the X-RateLimit-* convention.
The IETF draft is gaining adoption but is not yet universal.

6.4 Client Responsibilities

A well-behaved API client MUST:

  1. Read X-RateLimit-Remaining on every response
  2. Proactively slow down when remaining is low (not wait for 429)
  3. Respect Retry-After header when 429 is received
  4. Implement exponential backoff with jitter for retries
  5. Use idempotency keys for safe retries on non-idempotent operations

6.5 Full Response Example

HTTP/1.1 200 OK
Content-Type: application/json
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 847
X-RateLimit-Reset: 1735689660
X-RateLimit-Used: 153
X-RateLimit-Resource: core
HTTP/1.1 429 Too Many Requests
Content-Type: application/json
Retry-After: 3600
X-RateLimit-Limit: 1000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1735689660
 
{
    "message": "API rate limit exceeded for user 12345.",
    "documentation_url": "https://docs.example.com/rate-limits"
}

7. Where to Implement Rate Limiting

Rate limiting can be applied at multiple layers. Each layer has distinct trade-offs.

7.1 Layer Overview

[Client] --> [CDN/Edge] --> [Load Balancer] --> [API Gateway] --> [Application] --> [Database]
   |               |               |                  |                |
 Client-side    Geographic     IP-based          API Key-based    Business logic
  retry         limits &       coarse-grain      & user-based     fine-grain
  logic         DDoS           rate limits       rate limits      rate limits

7.2 Client Side

Purpose: Respect upstream limits proactively. Prevent self-inflicted 429 errors.

Implementation:

  • Track requests made and respect headers from server
  • Implement exponential backoff with jitter
  • Use request queuing libraries

Best for: SDK clients, service-to-service calls, batch processing jobs

Limitation: You cannot rely on clients to self-limit. Always enforce on the server side too.

7.3 CDN / Edge Layer (Cloudflare, Akamai, Fastly)

Purpose: Block volumetric attacks before they reach your origin. Geographic filtering.

Implementation:

  • Cloudflare Rate Limiting rules: define rate, window, response
  • Akamai Site Shield rules
  • WAF (Web Application Firewall) rate limiting rules

Best for: DDoS protection, bot mitigation, geographic blocking

Limitation: Limited to IP/geographic granularity. Cannot see authenticated user identity.

7.4 Load Balancer (Nginx, HAProxy, AWS ALB)

Purpose: Coarse-grained IP-based rate limiting before requests hit your application servers.

Nginx example:

http {
    limit_req_zone $binary_remote_addr zone=per_ip:10m rate=100r/m;
 
    server {
        location /api/ {
            limit_req zone=per_ip burst=20 nodelay;
            proxy_pass http://backend;
        }
    }
}

Best for: First line of defense, protecting application servers from IP-based floods.

Limitation: No knowledge of business context (user tiers, subscription limits).

7.5 API Gateway (Kong, AWS API Gateway, Apigee, Envoy)

Purpose: User/API key-based rate limiting with full request context.

Implementation:

  • AWS API Gateway: Usage Plans with throttle settings per stage/method
  • Kong: rate-limiting plugin per route, service, or consumer
  • Apigee: Quota and SpikeArrest policies

Best for: API-as-a-product platforms, monetization, multi-tenant SaaS

Limitation: Gateway is a shared component - limits become difficult to customize per user.

7.6 Application Layer (Your Code)

Purpose: Business-logic-aware rate limiting. User tier awareness, endpoint cost awareness.

Implementation:

  • Spring Boot Filter or Interceptor
  • Python middleware
  • Node.js middleware

Best for: Fine-grained control, business rule enforcement, subscription tier enforcement

Limitation: Runs in every application instance, requires distributed state (Redis) to work
across instances.

7.7 Service Mesh (Istio, Envoy, Linkerd)

Purpose: Rate limiting for internal service-to-service communication.

Envoy example:

rate_limits:
  - actions:
      - request_headers:
          header_name: x-user-id
          descriptor_key: user_id

Best for: Microservices architectures where individual services need protection from
other internal services (not just external clients).

7.8 Choosing Your Layer

ScenarioRecommended Layer
Public API with monetization tiersAPI Gateway + Application
Preventing DDoS from unknown IPsCDN/Edge or Load Balancer
Per-user business logic limitsApplication layer
Internal microservicesService Mesh
Service calling a third-party APIClient-side
Quick and dirty protectionLoad Balancer
Fine-grained controlApplication layer with Redis

Best practice: Implement rate limiting at MULTIPLE layers. CDN for DDoS, Load Balancer
for IP-based, Application for user-based. Defense in depth.


8. Rate Limiting Granularity and Scope

8.1 Time Window Granularity

GranularityUse CaseExample
Per secondReal-time APIs, streaming10 RPS for video frame API
Per minuteStandard REST APIs100 RPM for general use
Per hourBatch operations500 requests/hour for reports
Per dayQuota enforcement10,000 requests/day for free tier
Per monthBilling cycles1,000,000 requests/month for paid tier

8.2 Composite Limits

Production systems often stack multiple limits:

User "free_user_123":
  - Max 10 requests/second (burst control)
  - Max 100 requests/minute (normal rate)
  - Max 5,000 requests/day (daily quota)
  - Max 50,000 requests/month (monthly quota)

A request must pass ALL applicable limits. If any limit is exceeded, the request is rejected.

8.3 The "N requests per window" vs "N requests per rolling window" Distinction

This distinction matters for fairness:

Fixed window (N per window):

  • Window 1: 12:00:00 - 12:01:00 -> allows 100 requests
  • Window 2: 12:01:00 - 12:02:00 -> allows 100 requests
  • Problem: 100 requests at 12:00:59 + 100 at 12:01:01 = 200 requests in 2 seconds

Sliding window (N per rolling window):

  • "At any given moment, no more than 100 requests in the past 60 seconds"
  • More fair but more complex to implement

9. Soft Limits vs Hard Limits

Hard Limits

Requests over the limit are always rejected immediately. No exceptions.

if request_count > limit:
    return HTTP 429

Used when: Security boundaries, payment APIs, preventing abuse

Soft Limits

Requests over the limit are allowed up to a secondary threshold, or are allowed with
degraded quality of service.

if request_count > hard_limit:
    return HTTP 429
elif request_count > soft_limit:
    log warning, return response with warning header
    response.header("X-RateLimit-Warning", "Approaching limit")

Used when: User experience is critical, you want to warn before cutting off,
or when the cost of blocking a legitimate user is high.

Grace Period

Allow a short-term burst above the limit before enforcing. A user normally at 90 RPM
is allowed to briefly hit 120 RPM, but sustained exceeding triggers enforcement.


10. Inbound vs Outbound Rate Limiting

This is a distinction many developers miss. Rate limiting applies in BOTH directions.

10.1 Inbound Rate Limiting (Server Side)

Protecting YOUR service from too many incoming requests.

[External Client] ---(too many requests?)---> [Your API] ---(protect)--> [Your DB]

This is what most people think of when they hear "rate limiting."

10.2 Outbound Rate Limiting (Client Side)

Controlling how many requests YOU send to a downstream service.

[Your Service] ---(respect their limits)---> [Stripe API]
                                            [SendGrid API]
                                            [OpenAI API]
                                            [AWS Services]

If you call Stripe with 200 requests/second, they will rate limit you (429). This can
cascade: your API becomes slow/unavailable because your Stripe integration is 429-ing.

Outbound rate limiting implementation:

// Using Bucket4j to limit outbound calls to Stripe
public class StripeClient {
 
    private final Bucket bucket;
    private final StripeAPI stripeApi;
 
    public StripeClient() {
        // Stripe allows 100 requests/second per key
        Bandwidth limit = Bandwidth.classic(100,
                            Refill.greedy(100, Duration.ofSeconds(1)));
        this.bucket = Bucket4j.builder().addLimit(limit).build();
        this.stripeApi = new StripeAPI();
    }
 
    public ChargeResponse createCharge(ChargeRequest request) {
        // Block until a token is available (max wait: 500ms)
        if (bucket.tryConsume(1, Duration.ofMillis(500))) {
            return stripeApi.createCharge(request);
        } else {
            throw new RateLimitException("Outbound rate limit to Stripe exceeded");
        }
    }
}

Key insight for interviews: Ask whether the rate limiting is inbound or outbound.
Many candidates only think about inbound (protecting their own service) and miss the
equally important outbound case (respecting third-party API limits).


Summary

ConceptKey Takeaway
What it isControl how many requests a client can make in a time window
Why it mattersAbuse prevention, fair usage, cost control, SLA enforcement
Core unitLimit + Window + Action (reject/delay/queue)
TypesUser, IP, API key, endpoint, global, geographic, concurrent
vs ThrottlingRate limiting rejects; throttling delays
vs Circuit BreakerRate limiting is inbound request count; CB is outbound failure rate
HTTP code429 Too Many Requests + Retry-After header
Where to implementMultiple layers: Edge, LB, Gateway, Application, Service Mesh
GranularityPer second to per month, often stacked as composite limits
Hard vs SoftHard = always reject; Soft = warn before cutting off
Inbound vs OutboundProtect yourself AND respect others' limits

Next: Part 2 - Rate Limiting Algorithms Deep Dive

Learn exactly how Fixed Window, Sliding Window, Token Bucket, Leaky Bucket, and GCRA work -
with code, visuals, and the precise trade-offs between them.