Rate limiting (token, leaky, rolling)
Algorithms, distributed coordination, hot-path latency, and the operational pitfalls that bite in production.
Rate limiting looks like a counter. It is not. It is a distributed coordination problem with a single-digit millisecond budget. Get the algorithm wrong and you either let clients burst your origin into the ground or you frustrate legitimate users with false 429s. Get the placement wrong and your protection is bypassed. Get the math wrong and you cannot explain to a customer why their burst was rejected.
The four algorithms in detail
Fixed window
Count requests in the current calendar minute. If count > limit, reject. Reset to zero at the top of the next minute.
Pros: trivial to implement, one INCR and one EXPIRE in Redis.
Cons: a client can send limit requests at 12:00:59 and another limit at 12:01:00, effectively doubling the limit at the window boundary. For a 100 req/min limit, that is 200 requests in 2 seconds. Real attackers know this.
Sliding window log
Store a timestamp for every request in the last window. Count how many are within now - window. Reject if over limit.
Pros: exact. No boundary problem.
Cons: O(N) memory per key. For a customer doing 1000 req/min you store 1000 timestamps. At scale this kills Redis.
Sliding window counter
The compromise. Store the current window count and the previous window count. Estimate as current + previous * (1 - elapsed_fraction_of_current). Cloudflare uses this. Accurate within a few percent, O(1) memory.
Token bucket
Bucket of capacity B, refill rate R tokens per second. A request takes 1 token. If bucket empty, reject. The state is two numbers: tokens remaining and last-refill timestamp. Refill lazily when a request arrives.
tokens = min(B, tokens + (now - last_refill) * R)
last_refill = now
if tokens >= 1:
tokens -= 1
allow
else:
reject
This is the algorithm AWS, Stripe, GitHub, and most production APIs ship. It is intuitive: "you get 100 tokens, you refill at 10/s, do what you want."
Leaky bucket (GCRA variant)
Generic Cell Rate Algorithm. Mathematically equivalent to leaky bucket but stores only one timestamp instead of a queue. The "theoretical arrival time" of the next allowed request. Brandur's writeup at brandur.org is the canonical explainer.
Distributed coordination is the real problem
You have 50 API pods. A client hits whichever one the load balancer picks. Where does the counter live?
Option A: per-pod counter. No coordination. But limit is multiplied by pod count. A 100 req/s limit becomes 5000 req/s if you have 50 pods. Useless for fairness.
Option B: Redis with atomic Lua. One source of truth. Every request adds 1 Redis round-trip, typically 0.3-1ms in the same AZ. Works up to roughly 100k limiter ops/sec per Redis node.
Option C: probabilistic local + periodic sync. Each pod has its own bucket. Periodically syncs with central store. Faster but limit is approximate. Used at very high scale (Cloudflare, Discord).
Option D: consistent hashing. Shard limiter keys to specific pods. Pod owns its keys, no coordination. Failure of a pod loses some keys briefly. This is what high-scale systems use.
For your first production system, do option B. Move to D only when Redis becomes the bottleneck.
Where to put the limiter
Multiple layers, defense in depth:
- CDN / edge (Cloudflare, Fastly). Per-IP. Catches volumetric attacks before they hit your infra. Often free with the CDN.
- API gateway (Kong, Envoy, AWS APIG). Per-API-key, per-route. Coarse-grained protection.
- Service code. Per-tenant, per-endpoint, per-resource. Fine-grained, business logic aware. "User cannot create more than 10 invoices per minute."
- Database connection pool. Not really a rate limiter, but a hard cap on concurrency that backstops everything else.
A request that gets through all four layers is one your system definitely wants.
Hot-path latency budget
This is what I missed in the Spur interview. The rate limiter runs on every single request. If your p99 latency budget is 200ms and your limiter adds 5ms, you spent 2.5% of your budget on rate limiting. If it adds 50ms (a slow Postgres query), you spent 25%.
The math forces specific choices:
- No Postgres. Row locks are too slow and lock contention is brutal.
- Yes Redis with Lua. Single round trip, atomic, <1ms in-AZ.
- Yes in-process for ultra-hot paths. Bloom filters, local token buckets for non-critical fairness.
HTTP semantics
429 Too Many Requests is the only correct status code. From RFC 6585.
Headers to return:
Retry-After: 30- seconds the client should wait. Required.X-RateLimit-Limit: 100- the limit ceiling.X-RateLimit-Remaining: 0- how many they have left.X-RateLimit-Reset: 1718956800- unix timestamp when the bucket refills.
Good clients (Stripe SDK, AWS SDK) read these and back off. Bad clients ignore them and retry immediately, so your limiter needs to also protect against retry storms by being cheap to evaluate.
Failure modes and gotchas
Thundering herd on reset
Fixed window resets at the top of the minute. Every blocked client retries at exactly 12:01:00. You just created a coordinated attack. Fix: sliding window, or add jitter to Retry-After.
Limit per IP behind NAT
Mobile carriers NAT thousands of users behind one IP. Per-IP limits will block legit traffic. Limit per API key or per user ID instead, with per-IP as a wide safety net (say, 10x the per-user limit).
Cold start of new keys
First request for a new key creates a Redis entry. If your TTL is the window length, you pay this cost once per inactive period. Use a longer TTL or warm cache for known keys.
Limiter is the SPOF
If Redis dies, your whole API dies. Options:
- Fail open. Allow requests when limiter is unreachable. Risk: a real attack during a Redis outage takes down your origin.
- Fail closed. Reject when limiter is unreachable. Risk: a small Redis blip becomes a global outage.
- Local fallback. Each pod has a generous local limit, used only when Redis is unreachable. Best of both.
Multi-region
Global limits across regions require cross-region Redis or a global tier. Often not worth it. Most systems set per-region limits at total_limit / n_regions.
What I would build today
For a B2B API:
- CloudFront / Cloudflare in front. Per-IP, 1000 req/min. Free DDoS protection.
- API gateway. Per-API-key, token bucket. 100 sustained, 200 burst.
- Service-level. Per-tenant per-endpoint. Redis + Lua. Fail open with local backup.
- Postgres pgbouncer. Connection cap. Backstop.
- Observability. Histogram of
X-RateLimit-Remainingper key. Alert when any key sits at 0 for >5 minutes (either they need more quota, or they have a bug).
That stack handles everything from credential stuffing to a noisy paid customer, and it costs almost nothing to run.
Learn more
- ArticleCloudflare: How we built rate limitingCloudflare blog
- ArticleStripe: Scaling your API with rate limitersStripe blog
- ArticleGCRA - A rate limiting algorithmbrandur.org
- ArticleDesigning Data-Intensive ApplicationsMartin Kleppmann
- DocsRedis INCR patternRedis docs