Rate Limiting — False 429 Errors from Partial Node Updates
Hot-reload script updated only 3 of 10 nodes — 30% of Enterprise requests got false 429s.
- Rate limiting controls how many requests a client can make in a given time window
- Four components: Rule Store (policy), State Store (counts), Decision Engine (algorithm), Response Handler (headers)
- Token Bucket is the best default for user-facing APIs: bursts up to capacity, steady average via refill
- Redis is the standard shared state store for distributed rate limiters — INCR+EXPIRE is atomic and fast (~1ms per decision)
- Production insight: a misconfigured Redis failover can silently double your limit; always test degrade scenarios
- Biggest mistake: rate limiting by IP alone — NAT groups thousands of users behind one IP, blocking them all
Imagine a popular theme park ride. The ride can only handle 10 people every 5 minutes — so the staff put up a barrier and only let 10 people through at a time. Everyone else waits in line. Rate limiting is exactly that barrier for your API or service: it controls how many requests are allowed through in a given window of time, so the 'ride' (your server) never gets overwhelmed and everyone gets a fair turn.
Every production system you'll ever work on will eventually face the same villain: a surge of traffic that nobody planned for. It might be a viral moment, a misconfigured client hammering your API in a loop, or a bad actor trying to scrape your data. Without a gate, that surge hits your database, your CPU, and your users — all at once. Rate limiting is that gate, and understanding its internals is the difference between an API that survives launch day and one that doesn't.
The core problem rate limiting solves is resource fairness under pressure. Your server has finite CPU, memory, and I/O bandwidth. If one client can fire 10,000 requests per second, every other client suffers. Rate limiting enforces a contract: you get X requests in Y time, and anything beyond that gets slowed down or rejected. It protects downstream services, enforces pricing tiers (free vs. paid plans), and prevents abuse — all without you needing to scale hardware every time someone writes a bad for-loop.
I’ve implemented rate limiters at three different scale levels — from a 200 RPS internal tool to a public API handling 80k+ RPS during peak hours. By the end of this article you'll know the four main rate limiting components and how they fit together, understand the trade-offs between Token Bucket, Leaky Bucket, Fixed Window, and Sliding Window algorithms, be able to sketch a distributed rate limiter on a whiteboard, and recognise the two mistakes that still burn engineers in production. Let's build it piece by piece.
The Four Core Components Every Rate Limiter Needs
A rate limiter isn't a single thing — it's a pipeline of four cooperating components. Knowing each one's job tells you exactly where to look when something goes wrong in production.
1. The Rule Store holds the policy: who gets how many requests, over what time window, and for which endpoint. Rules might say 'free-tier users get 100 requests/minute on /search, paid users get 1000.' In my last company we kept this in a DynamoDB table with hot-reload support — changing a tier’s limit took effect in under 3 seconds without restarting any service.
2. The Counter/State Store is where the actual counting happens. For single-node systems this can be in-memory. For distributed systems it's almost always Redis — because Redis is fast, atomic, and shared across every server in your cluster. This is the most critical component: if it's wrong, your limits are wrong. I once saw a team lose an entire weekend because their Redis cluster was partitioned and two nodes were counting independently.
3. The Decision Engine is the algorithm. It reads the state from the counter store, applies the rule from the rule store, and returns one of three verdicts: ALLOW, THROTTLE (slow down), or REJECT. The algorithm you pick here defines the user experience — smooth and forgiving vs. hard cutoffs.
4. The Response Handler communicates the decision back to the caller. A well-behaved rate limiter doesn't just drop requests silently — it returns HTTP 429 with headers like Retry-After, X-RateLimit-Limit, and X-RateLimit-Remaining so clients can back off gracefully. In one production incident, adding these headers reduced our retry storm by 68% overnight.
The Four Algorithms: Token Bucket vs Sliding Window vs the Rest
The Decision Engine can use several different algorithms, and the one you pick has a profound effect on user experience and system complexity. Here's how each one thinks.
Fixed Window (what we coded above) splits time into hard buckets — e.g. every minute resets to zero. It's simple, but has a nasty edge case: a client can fire 10 requests at 12:00:59 and another 10 at 12:01:01, effectively getting 20 requests in 2 seconds. This is called the boundary burst problem.
Sliding Window Log fixes this by storing a timestamp for every request and counting only those within the last N seconds. Accurate, but expensive in memory — you're storing one record per request.
Sliding Window Counter is the pragmatic middle ground: it blends the previous window's count with the current one using a weighted average. Much lighter on memory, still smooth.
Token Bucket is what most major APIs (Stripe, GitHub, AWS) actually use. Think of it as a bucket that refills at a steady rate — say, 2 tokens per second up to a max of 10. Each request costs one token. This naturally allows short bursts (drain the bucket) while enforcing a long-term average rate (the refill rate). It's the most user-friendly algorithm because it doesn't punish bursty-but-reasonable traffic.
Leaky Bucket is Token Bucket's stricter cousin. Requests go into a queue (the bucket) and are processed at a fixed rate. Excess requests overflow and are dropped. Use it when you need perfectly smooth output, like protecting a downstream service that can't handle any spikes.
All Five Algorithms Compared: Pros, Cons, and Best Fit
Choosing the right algorithm depends on your traffic patterns, accuracy needs, and operational cost. Here's a breakdown of each algorithm with explicit pros and cons.
### Fixed Window - Pros: Extremely simple to implement, low memory (one counter per client per window), no need for timestamps beyond window start. - Cons: Suffers from the boundary burst problem — clients can double their rate at window edges. Not suitable for any use case requiring smooth or fair distribution. - Best fit: Internal tools, low-stakes APIs where occasional bursts are acceptable.
### Sliding Window Log - Pros: Perfectly accurate — counts exactly how many requests occurred in the last N seconds. No boundary burst. Ideal for audit-trail requirements. - Cons: Memory-inefficient — stores a timestamp per request. For high-traffic clients, this can become expensive. O(n) memory per client. - Best fit: Security rate limiting, financial systems, or any scenario where you need an exact log.
### Sliding Window Counter - Pros: Excellent trade-off between accuracy and memory. Uses two counters (current and previous window) and a weighted average. Approximation error is small (typically <1%). - Cons: Slightly more complex than Fixed Window. Not perfectly exact. - Best fit: Most production APIs — delivers smooth limiting without the memory cost of sliding window log.
### Token Bucket - Pros: Allows natural bursts up to capacity, then enforces a steady-state average via refill rate. Intuitive parameters map to product limits. Widely used by major APIs (Stripe, GitHub, AWS). - Cons: Refill rate and capacity must be carefully tuned — too aggressive starves legitimate bursts, too lenient defeats the limit. Not exact — a burst can temporarily exceed the average. - Best fit: User-facing APIs, SDKs, any traffic pattern with natural bursts (app startup, pagination).
### Leaky Bucket - Pros: Produces perfectly smooth output — requests are processed at a fixed rate. Protects fragile downstream services from any spike. - Cons: No burst capability — all requests are queued, which can add latency. If the queue fills, requests are dropped without buffering. - Best fit: Protecting databases, payment gateways, or any downstream service that cannot tolerate sudden load increases.
Visualizing Token Bucket and Leaky Bucket
Understanding how these two algorithms work is easier with a visual. The diagram below shows the core flow of Token Bucket (allowing bursts up to capacity) and Leaky Bucket (smoothing output to a fixed rate).
Algorithm Selection Guide: When to Use Each Rate Limiting Algorithm
Choosing among the five algorithms comes down to three questions: Do you need exact counts? Does your traffic come in bursts? Is your downstream service fragile?
Use Fixed Window when: - You need a quick, low-overhead limiter for internal tools or non-critical paths. - You can tolerate occasional double-bursts at window boundaries. - Memory is tight and you have many clients.
Use Sliding Window Log when: - Every single request must be audited with perfect accuracy (e.g., billing, security). - You have the memory budget to store a timestamp per request. - Boundary bursts are unacceptable.
Use Sliding Window Counter when: - You want the best accuracy-to-cost ratio. - You can accept a tiny approximation error (~1%). - You are building the default rate limiter for most API endpoints.
Use Token Bucket when: - You are building a user-facing API where bursts are natural (app startup, refresh, pagination). - You want simple parameters that map to product limits (e.g., "burst up to 10 requests, then 1 per second"). - You follow the pattern of Stripe, GitHub, and AWS.
Use Leaky Bucket when: - You need to protect a fragile downstream service that cannot tolerate any spikes (e.g., a legacy database, a payment gateway). - Perfectly smooth output is more important than allowing bursts. - You accept that a full queue causes immediate request drops.
Distributed Rate Limiting: Why Redis Is the State Store of Choice
Everything above works perfectly on a single server. The moment you have two servers behind a load balancer, you have a problem: each server has its own in-memory counter. If user_42 hits server A for 8 requests and server B for 8 requests, both servers think the limit hasn't been hit — but the user actually sent 16 requests. Your rate limiter is broken.
The fix is a shared, atomic state store — and Redis is the industry-standard answer. Redis is single-threaded internally, which means its commands are inherently atomic. The INCR command increments a key and returns the new value in a single atomic operation. Pair that with EXPIRE to auto-delete the key when the window ends, and you have a thread-safe, distributed-safe counter with two lines of Redis.
For the sliding window counter in Redis, the pattern is slightly more sophisticated: use a sorted set (ZADD) where each member is a request timestamp, then use ZCOUNT to count members in the last N seconds and ZREMRANGEBYSCORE to evict old entries. This gives you perfect accuracy without the boundary burst problem of fixed windows.
The critical trade-off: Redis adds a network round-trip (typically 0.1–2ms) to every single request decision. For most APIs that's fine. For ultra-low-latency scenarios (sub-5ms response time targets), consider a two-layer approach: a local in-memory limiter for coarse-grained fast rejection, with Redis as the authoritative source for precise enforcement.
When Your Rate Limiter Breaks: Debugging Common Production Failures
Rate limiters fail in subtle ways. The most common? Silent degradation where the limiter lets too many requests through because Redis was partitioned and each node counted independently. Another classic: a client gets 429s despite being within limits because the clock on the rate limiter server is skewed a few minutes ahead — the window appears full when it shouldn't be.
False positives happen when the state store is too coarse (e.g., IP-based limits inside a corporate NAT). You'll see a flood of support tickets from one company whose employees can't access your API. False negatives happen when Redis goes down and the fallback allow logic stays silent — your service gets hammered.
Debugging these requires visibility. Log every rate limit decision (ALLOW, REJECT, reason) with client ID, endpoint, and time. Use structured logs. If you can't see which decision your limiter made, you're flying blind.
Another trap: burst smoothing. A naive Token Bucket might allow a 5x burst in one second, which downstream databases hate. Always add a secondary per-second limiter before the database layer.
Where to Place Rate Limiters: API Gateway, Application Layer, or Load Balancer?
Rate limiters can live at different layers of your infrastructure. Each placement has trade-offs in terms of flexibility, latency, and management overhead.
### API Gateway (e.g., Kong, AWS API Gateway, Apigee) - Pros: Centralized configuration — change limits without touching application code. Often includes built-in support for token bucket or fixed window. Easiest to maintain across many services. - Cons: Limited algorithm choices — most gateways only offer fixed window or basic token bucket. Can become a bottleneck if not scaled. Adds unavoidable latency (typically 1-5ms). - Best for: Teams with multiple microservices that need uniform rate limit policies. Also good for external-facing APIs where a separate gateway already exists.
### Application Layer (within your service code) - Pros: Full control — you can use any algorithm, tie limits to business logic (e.g., different limits per user role), and implement per-endpoint cost-based limits. Lowest latency (in-memory counters). - Cons: Requires implementing and maintaining the rate limiter in each service. Harder to enforce cross-service limits (e.g., total requests across all endpoints). Code duplication risk. - Best for: Services with custom or complex rate limit needs. Also for latency-sensitive endpoints where an extra network hop is unacceptable.
### Load Balancer (e.g., Nginx, HAProxy, AWS ALB) - Pros: Very low overhead — load balancers are optimized for fast packet processing. Can rate-limit before requests even reach your application, providing a hard outer defense. Some support connection limiting and request rate limiting. - Cons: Limited to simple algorithms (mostly fixed window or connection-based). Cannot inspect request bodies or custom headers deeply. Difficult to implement per-user or per-tier limits (load balancers typically operate at IP level). - Best for: First line of defense against DDoS or brute-force attacks. Use for coarse-grained IP-based limits before applying finer-grained limits in the application.
### Recommendation: Layered Approach Start with a load balancer for IP-based rate limiting (protect against DDoS). Add an API Gateway for tenant-based limits if you have multiple services. Then implement application-layer rate limiting with Token Bucket for user-specific logic. This gives you defense in depth.
Black Friday Rate Limit Misconfiguration Caused 30-Minute API Outage
GET /health/rate-limiter endpoint that exposed current rules per tier. The monitoring team added a Prometheus alert on per-node rejection rates >1%.- A hot-reloadable rule store can diverge across nodes — always version your rules and validate they match across all instances.
- Rolling updates of rate limit rules should be gradual, with automated rollback on anomaly detection.
- Add a health endpoint that returns current limits; it saves hours during post-mortem analysis.
X-RateLimit-Limit and X-RateLimit-Remaining headers on the rejected request. Verify the client ID being used for limiting (might be IP instead of user ID). Compare across multiple nodes to detect configuration drift.ALLOW, you’re unthrottled. Inspect Redis key TTLs: if EXPIRE didn't set correctly due to pipeline error, counters persist forever. Also check clock skew (NTP) — a skewed clock shifts windows.That's Components. Mark it forged?
10 min read · try the examples if you haven't