Intermediate 10 min · March 05, 2026

Rate Limiting — False 429 Errors from Partial Node Updates

Hot-reload script updated only 3 of 10 nodes — 30% of Enterprise requests got false 429s.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Rate limiting controls how many requests a client can make in a given time window
  • Four components: Rule Store (policy), State Store (counts), Decision Engine (algorithm), Response Handler (headers)
  • Token Bucket is the best default for user-facing APIs: bursts up to capacity, steady average via refill
  • Redis is the standard shared state store for distributed rate limiters — INCR+EXPIRE is atomic and fast (~1ms per decision)
  • Production insight: a misconfigured Redis failover can silently double your limit; always test degrade scenarios
  • Biggest mistake: rate limiting by IP alone — NAT groups thousands of users behind one IP, blocking them all
Plain-English First

Imagine a popular theme park ride. The ride can only handle 10 people every 5 minutes — so the staff put up a barrier and only let 10 people through at a time. Everyone else waits in line. Rate limiting is exactly that barrier for your API or service: it controls how many requests are allowed through in a given window of time, so the 'ride' (your server) never gets overwhelmed and everyone gets a fair turn.

Every production system you'll ever work on will eventually face the same villain: a surge of traffic that nobody planned for. It might be a viral moment, a misconfigured client hammering your API in a loop, or a bad actor trying to scrape your data. Without a gate, that surge hits your database, your CPU, and your users — all at once. Rate limiting is that gate, and understanding its internals is the difference between an API that survives launch day and one that doesn't.

The core problem rate limiting solves is resource fairness under pressure. Your server has finite CPU, memory, and I/O bandwidth. If one client can fire 10,000 requests per second, every other client suffers. Rate limiting enforces a contract: you get X requests in Y time, and anything beyond that gets slowed down or rejected. It protects downstream services, enforces pricing tiers (free vs. paid plans), and prevents abuse — all without you needing to scale hardware every time someone writes a bad for-loop.

I’ve implemented rate limiters at three different scale levels — from a 200 RPS internal tool to a public API handling 80k+ RPS during peak hours. By the end of this article you'll know the four main rate limiting components and how they fit together, understand the trade-offs between Token Bucket, Leaky Bucket, Fixed Window, and Sliding Window algorithms, be able to sketch a distributed rate limiter on a whiteboard, and recognise the two mistakes that still burn engineers in production. Let's build it piece by piece.

The Four Core Components Every Rate Limiter Needs

A rate limiter isn't a single thing — it's a pipeline of four cooperating components. Knowing each one's job tells you exactly where to look when something goes wrong in production.

1. The Rule Store holds the policy: who gets how many requests, over what time window, and for which endpoint. Rules might say 'free-tier users get 100 requests/minute on /search, paid users get 1000.' In my last company we kept this in a DynamoDB table with hot-reload support — changing a tier’s limit took effect in under 3 seconds without restarting any service.

2. The Counter/State Store is where the actual counting happens. For single-node systems this can be in-memory. For distributed systems it's almost always Redis — because Redis is fast, atomic, and shared across every server in your cluster. This is the most critical component: if it's wrong, your limits are wrong. I once saw a team lose an entire weekend because their Redis cluster was partitioned and two nodes were counting independently.

3. The Decision Engine is the algorithm. It reads the state from the counter store, applies the rule from the rule store, and returns one of three verdicts: ALLOW, THROTTLE (slow down), or REJECT. The algorithm you pick here defines the user experience — smooth and forgiving vs. hard cutoffs.

4. The Response Handler communicates the decision back to the caller. A well-behaved rate limiter doesn't just drop requests silently — it returns HTTP 429 with headers like Retry-After, X-RateLimit-Limit, and X-RateLimit-Remaining so clients can back off gracefully. In one production incident, adding these headers reduced our retry storm by 68% overnight.

io.thecodeforge.ratelimiting.components.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# io.thecodeforge.ratelimiting.components

import time
import threading
from dataclasses import dataclass, field
from typing import Dict, Tuple

# ─── COMPONENT 1: Rule Store ───────────────────────────────────────────────
# In production this would be loaded from a config file or database.
RATE_LIMIT_RULES: Dict[str, Tuple[int, int]] = {
    # tier_name: (max_requests, window_seconds)
    "free":       (10, 60),    # 10 requests per minute
    "paid":       (100, 60),   # 100 requests per minute
    "enterprise": (1000, 60),  # 1000 requests per minute
}

# ─── COMPONENT 2: Counter Store (in-memory, thread-safe) ───────────────────
# In a distributed system, replace this with a Redis INCR + EXPIRE call.
@dataclass
class WindowCounter:
    count: int = 0
    window_start: float = field(default_factory=time.time)

class InMemoryCounterStore:
    def __init__(self):
        self._store: Dict[str, WindowCounter] = {}
        self._lock = threading.Lock()  # Prevents race conditions in multi-threaded apps

    def increment_and_get(self
Output
=== Free tier client firing 12 requests (limit is 10/min) ===
Request #1: [200 OK] client=user_free_42 X-RateLimit-Remaining=9 X-RateLimit-Limit=10
Request #2: [200 OK] client=user_free_42 X-RateLimit-Remaining=8 X-RateLimit-Limit=10
Request #3: [200 OK] client=user_free_42 X-RateLimit-Remaining=7 X-RateLimit-Limit=10
Request #4: [200 OK] client=user_free_42 X-RateLimit-Remaining=6 X-RateLimit-Limit=10
Request #5: [200 OK] client=user_free_42 X-RateLimit-Remaining=5 X-RateLimit-Limit=10
Request #6: [200 OK] client=user_free_42 X-RateLimit-Remaining=4 X-RateLimit-Limit=10
Request #7: [200 OK] client=user_free_42 X-RateLimit-Remaining=3 X-RateLimit-Limit=10
Request #8: [200 OK] client=user_free_42 X-RateLimit-Remaining=2 X-RateLimit-Limit=10
Request #9: [200 OK] client=user_free_42 X-RateLimit-Remaining=1 X-RateLimit-Limit=10
Request #10: [200 OK] client=user_free_42 X-RateLimit-Remaining=0 X-RateLimit-Limit=10
Request #11: [429 REJECT] client=user_free_42 Retry-After=58s X-RateLimit-Limit=10
Request #12: [429 REJECT] client=user_free_42 Retry-After=57s X-RateLimit-Limit=10
Pro Tip: Always Return Rate Limit Headers
Even on successful (200) responses, include X-RateLimit-Limit and X-RateLimit-Remaining. Well-built API clients use these to self-throttle before they get rejected — which means fewer retries, less noise in your logs, and a much better developer experience.
Production Insight
The state store is the single source of truth; if Redis goes down, every server counts independently.
The response handler missing headers causes clients to retry aggressively, amplifying load.
Rule: always fail gracefully and log loudly when the state store is unreachable.
Key Takeaway
Rate limiting is a pipeline, not a single block.
Each component fails differently; test each one in isolation.
The response handler is your last line of defense — use it to educate clients.

The Four Algorithms: Token Bucket vs Sliding Window vs the Rest

The Decision Engine can use several different algorithms, and the one you pick has a profound effect on user experience and system complexity. Here's how each one thinks.

Fixed Window (what we coded above) splits time into hard buckets — e.g. every minute resets to zero. It's simple, but has a nasty edge case: a client can fire 10 requests at 12:00:59 and another 10 at 12:01:01, effectively getting 20 requests in 2 seconds. This is called the boundary burst problem.

Sliding Window Log fixes this by storing a timestamp for every request and counting only those within the last N seconds. Accurate, but expensive in memory — you're storing one record per request.

Sliding Window Counter is the pragmatic middle ground: it blends the previous window's count with the current one using a weighted average. Much lighter on memory, still smooth.

Token Bucket is what most major APIs (Stripe, GitHub, AWS) actually use. Think of it as a bucket that refills at a steady rate — say, 2 tokens per second up to a max of 10. Each request costs one token. This naturally allows short bursts (drain the bucket) while enforcing a long-term average rate (the refill rate). It's the most user-friendly algorithm because it doesn't punish bursty-but-reasonable traffic.

Leaky Bucket is Token Bucket's stricter cousin. Requests go into a queue (the bucket) and are processed at a fixed rate. Excess requests overflow and are dropped. Use it when you need perfectly smooth output, like protecting a downstream service that can't handle any spikes.

io.thecodeforge.ratelimiting.token_bucket.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# io.thecodeforge.ratelimiting.token_bucket

import time
import threading
from dataclasses import dataclass

# Token Bucket Algorithm
# WHY use this? It allows short bursts of traffic while enforcing a
# long-term average rate. Stripe, GitHub, and Twilio all use variants of this.

@dataclass
class TokenBucket:
    capacity: int          # Maximum tokens the bucket can hold (burst limit)
    refill_rate: float     # Tokens added per second (steady-state limit)
    _tokens: float = 0.0
    _last_refill_time: float = 0.0
    _lock: threading.Lock = None

    def __post_init__(self):
        self._tokens = float(self.capacity)  # Start full — first burst is always allowed
        self._last_refill_time = time.monotonic()  # monotonic is safer than time.time() for intervals
        self._lock = threading.Lock()

    def _refill(self):
        """Called before every consume() to add tokens earned since last request."""
        now = time.monotonic()
        elapsed_seconds = now - self._last_refill_time

        # How many tokens have we earned in the time since the last request?
        tokens_earned = elapsed_seconds * self.refill_rate

        # Cap at capacity — you can't save up more than the bucket holds
        self._tokens = min(self.capacity, self._tokens + tokens_earned)
        self._last_refill_time = now

    def consume(self, tokens_needed: int = 1) -> bool:
        """Try to consume tokens. Returns True if allowed, False if rate-limited."""
        with self._lock:  # Thread-safe — critical in async/multi-threaded servers
            self._refill()

            if self._tokens >= tokens_needed:
                self._tokens -= tokens_needed
                return True  # ALLOW
            return False     # REJECT — not enough tokens

    @property
    def tokens_available(self) -> float:
        """Read current token count (approximate — for logging only)."""
        with self._lock:
            self._refill()
            return round(self._tokens, 2)


def simulate_api_traffic():
    """
    Simulates a realistic traffic pattern:
    - An initial burst (legitimate, e.g. app startup)
    - Steady traffic
    - Another burst that partially hits the limit
    """
    # 5 tokens max (burst), refills at 1 token/second (steady-state: 1 req/sec)
    bucket = TokenBucket(capacity=5, refill_rate=1.0)

    print("=== Simulating Token Bucket Rate Limiter ===")
    print(f"Config: capacity=5 tokens, refill=1 token/second\n")

    # Phase 1: Burst of 7 requests (only 5 should pass — bucket starts full at 5)
    print("[Phase 1] Burst of 7 requests fired instantly:")
    for req_num in range(1, 8):
        allowed = bucket.consume()
        status = "ALLOW ✓" if allowed else "REJECT ✗"
        print(f"  Request #{req_num}: {status} | Tokens left: {bucket.tokens_available}")

    # Phase 2: Wait 3 seconds — bucket refills 3 tokens
    print("\n[Phase 2] Waiting 3 seconds for bucket to refill...")
    time.sleep(3)
    print(f"  Tokens after 3s wait: {bucket.tokens_available}")

    # Phase 3: Send 4 more requests — 3 should pass, 1 should be rejected
    print("\n[Phase 3] Sending 4 more requests after refill:")
    for req_num in range(8, 12):
        allowed = bucket.consume()
        status = "ALLOW ✓" if allowed else "REJECT ✗"
        print(f"  Request #{req_num}: {status} | Tokens left: {bucket.tokens_available}")


if __name__ == "__main__":
    simulate_api_traffic()
Output
=== Simulating Token Bucket Rate Limiter ===
Config: capacity=5 tokens, refill=1 token/second
[Phase 1] Burst of 7 requests fired instantly:
Request #1: ALLOW ✓ | Tokens left: 4.0
Request #2: ALLOW ✓ | Tokens left: 3.0
Request #3: ALLOW ✓ | Tokens left: 2.0
Request #4: ALLOW ✓ | Tokens left: 1.0
Request #5: ALLOW ✓ | Tokens left: 0.0
Request #6: REJECT ✗ | Tokens left: 0.0
Request #7: REJECT ✗ | Tokens left: 0.0
[Phase 2] Waiting 3 seconds for bucket to refill...
Tokens after 3s wait: 3.0
[Phase 3] Sending 4 more requests after refill:
Request #8: ALLOW ✓ | Tokens left: 2.0
Request #9: ALLOW ✓ | Tokens left: 1.0
Request #10: ALLOW ✓ | Tokens left: 0.0
Request #11: REJECT ✗ | Tokens left: 0.0
Interview Gold: Why Token Bucket > Fixed Window
When asked 'which algorithm would you use?', say Token Bucket and explain: it naturally allows legitimate bursts (app startup, after a pause), it enforces a long-term average via the refill rate, and the two parameters (capacity and refill_rate) map cleanly to product decisions (burst limit and sustained rate). This answer shows you've thought about UX, not just computer science.
Production Insight
Fixed Window's boundary burst lets clients double-spend across window edges; seen in production with billing overcharges.
Token Bucket's refill rate must be tuned: too slow starves legitimate bursts, too fast defeats the limit.
Rule: always test with burst traffic patterns, not just steady-state.
Key Takeaway
Token Bucket for user-facing APIs — bursts feel natural.
Sliding Window for audit-critical systems — exact counts.
Leaky Bucket for downstream service protection — smooth output.
Choosing an Algorithm
IfNeed simple, low-overhead rate limiter?
UseUse Fixed Window
IfRequire exact, audit-grade counts?
UseUse Sliding Window Log
IfBest cost/accuracy trade-off?
UseUse Sliding Window Counter
IfUser-facing API with bursty traffic?
UseUse Token Bucket
IfNeed perfectly smooth output downstream?
UseUse Leaky Bucket

All Five Algorithms Compared: Pros, Cons, and Best Fit

Choosing the right algorithm depends on your traffic patterns, accuracy needs, and operational cost. Here's a breakdown of each algorithm with explicit pros and cons.

### Fixed Window - Pros: Extremely simple to implement, low memory (one counter per client per window), no need for timestamps beyond window start. - Cons: Suffers from the boundary burst problem — clients can double their rate at window edges. Not suitable for any use case requiring smooth or fair distribution. - Best fit: Internal tools, low-stakes APIs where occasional bursts are acceptable.

### Sliding Window Log - Pros: Perfectly accurate — counts exactly how many requests occurred in the last N seconds. No boundary burst. Ideal for audit-trail requirements. - Cons: Memory-inefficient — stores a timestamp per request. For high-traffic clients, this can become expensive. O(n) memory per client. - Best fit: Security rate limiting, financial systems, or any scenario where you need an exact log.

### Sliding Window Counter - Pros: Excellent trade-off between accuracy and memory. Uses two counters (current and previous window) and a weighted average. Approximation error is small (typically <1%). - Cons: Slightly more complex than Fixed Window. Not perfectly exact. - Best fit: Most production APIs — delivers smooth limiting without the memory cost of sliding window log.

### Token Bucket - Pros: Allows natural bursts up to capacity, then enforces a steady-state average via refill rate. Intuitive parameters map to product limits. Widely used by major APIs (Stripe, GitHub, AWS). - Cons: Refill rate and capacity must be carefully tuned — too aggressive starves legitimate bursts, too lenient defeats the limit. Not exact — a burst can temporarily exceed the average. - Best fit: User-facing APIs, SDKs, any traffic pattern with natural bursts (app startup, pagination).

### Leaky Bucket - Pros: Produces perfectly smooth output — requests are processed at a fixed rate. Protects fragile downstream services from any spike. - Cons: No burst capability — all requests are queued, which can add latency. If the queue fills, requests are dropped without buffering. - Best fit: Protecting databases, payment gateways, or any downstream service that cannot tolerate sudden load increases.

Quick Decision: Token Bucket vs Leaky Bucket?
Ask yourself: Is perfect smoothness required? If yes → Leaky Bucket. If you want to allow bursts while maintaining average rate → Token Bucket. For everything else, consider Sliding Window Counter as the pragmatic default.
Production Insight
The boundary burst of Fixed Window is not just a theoretical problem — we've seen it cause billing overcharges in a SaaS product where clients exploited window edges to double their usage. Sliding Window Counter eliminates that edge case with minimal extra complexity.
Key Takeaway
No single algorithm fits all. Match the algorithm to your traffic pattern and accuracy requirements. For most APIs, Token Bucket or Sliding Window Counter is the best starting point.

Visualizing Token Bucket and Leaky Bucket

Understanding how these two algorithms work is easier with a visual. The diagram below shows the core flow of Token Bucket (allowing bursts up to capacity) and Leaky Bucket (smoothing output to a fixed rate).

Reading the Diagram
In Token Bucket, tokens are consumed per request and refilled over time. The bucket's capacity allows short bursts. In Leaky Bucket, requests are queued and processed at a constant rate — if the queue is full, the request is immediately dropped.
Production Insight
Diagrams like these are invaluable when explaining algorithm choices to product managers or junior engineers. A picture of the queue overflow in Leaky Bucket often clarifies why it's not a good fit for bursty user traffic.
Key Takeaway
Token Bucket allows bursts; Leaky Bucket enforces smooth output. Visualizing the flow helps choose the right one for your system's constraints.

Algorithm Selection Guide: When to Use Each Rate Limiting Algorithm

Choosing among the five algorithms comes down to three questions: Do you need exact counts? Does your traffic come in bursts? Is your downstream service fragile?

Use Fixed Window when: - You need a quick, low-overhead limiter for internal tools or non-critical paths. - You can tolerate occasional double-bursts at window boundaries. - Memory is tight and you have many clients.

Use Sliding Window Log when: - Every single request must be audited with perfect accuracy (e.g., billing, security). - You have the memory budget to store a timestamp per request. - Boundary bursts are unacceptable.

Use Sliding Window Counter when: - You want the best accuracy-to-cost ratio. - You can accept a tiny approximation error (~1%). - You are building the default rate limiter for most API endpoints.

Use Token Bucket when: - You are building a user-facing API where bursts are natural (app startup, refresh, pagination). - You want simple parameters that map to product limits (e.g., "burst up to 10 requests, then 1 per second"). - You follow the pattern of Stripe, GitHub, and AWS.

Use Leaky Bucket when: - You need to protect a fragile downstream service that cannot tolerate any spikes (e.g., a legacy database, a payment gateway). - Perfectly smooth output is more important than allowing bursts. - You accept that a full queue causes immediate request drops.

Interview-Ready: The One-Liner Selection Criteria
If the requirement says 'burst tolerant' → Token Bucket. If 'exact counting' → Sliding Window Log. If 'smooth output' → Leaky Bucket. If 'simple and cheap' → Fixed Window. If 'balanced' → Sliding Window Counter.
Production Insight
I've seen teams adopt Leaky Bucket for user-facing APIs because 'it sounds safer', only to get complaints about latency from queueing. Always match the algorithm to the traffic pattern, not the name.
Key Takeaway
Selection is a three-axis trade-off: accuracy, burst tolerance, and smoothness. Match your dominant axis to the algorithm that excels at it.

Distributed Rate Limiting: Why Redis Is the State Store of Choice

Everything above works perfectly on a single server. The moment you have two servers behind a load balancer, you have a problem: each server has its own in-memory counter. If user_42 hits server A for 8 requests and server B for 8 requests, both servers think the limit hasn't been hit — but the user actually sent 16 requests. Your rate limiter is broken.

The fix is a shared, atomic state store — and Redis is the industry-standard answer. Redis is single-threaded internally, which means its commands are inherently atomic. The INCR command increments a key and returns the new value in a single atomic operation. Pair that with EXPIRE to auto-delete the key when the window ends, and you have a thread-safe, distributed-safe counter with two lines of Redis.

For the sliding window counter in Redis, the pattern is slightly more sophisticated: use a sorted set (ZADD) where each member is a request timestamp, then use ZCOUNT to count members in the last N seconds and ZREMRANGEBYSCORE to evict old entries. This gives you perfect accuracy without the boundary burst problem of fixed windows.

The critical trade-off: Redis adds a network round-trip (typically 0.1–2ms) to every single request decision. For most APIs that's fine. For ultra-low-latency scenarios (sub-5ms response time targets), consider a two-layer approach: a local in-memory limiter for coarse-grained fast rejection, with Redis as the authoritative source for precise enforcement.

io.thecodeforge.ratelimiting.distributed.pyPYTHON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# io.thecodeforge.ratelimiting.distributed
# Requires: pip install redis
# Assumes Redis running on localhost:6379
# Run: docker run -p 6379:6379 redis

import redis
import time
from typing import Tuple

# ─── Redis connection ────────────────────────────────────────────────────────
# In production, use connection pooling and a Redis Sentinel or Cluster setup
redis_client = redis.Redis(
    host="localhost",
    port=6379,
    db=0,
    decode_responses=True  # Return strings, not bytes
)

class DistributedFixedWindowLimiter:
    """
    Uses Redis INCR + EXPIRE for atomic, distributed-safe counting.
    WHY INCR? It's a single atomic command — no race condition between
    'read current count' and 'write new count' that you'd get in application code.
    """

    def __init__(self

When Your Rate Limiter Breaks: Debugging Common Production Failures

Rate limiters fail in subtle ways. The most common? Silent degradation where the limiter lets too many requests through because Redis was partitioned and each node counted independently. Another classic: a client gets 429s despite being within limits because the clock on the rate limiter server is skewed a few minutes ahead — the window appears full when it shouldn't be.

False positives happen when the state store is too coarse (e.g., IP-based limits inside a corporate NAT). You'll see a flood of support tickets from one company whose employees can't access your API. False negatives happen when Redis goes down and the fallback allow logic stays silent — your service gets hammered.

Debugging these requires visibility. Log every rate limit decision (ALLOW, REJECT, reason) with client ID, endpoint, and time. Use structured logs. If you can't see which decision your limiter made, you're flying blind.

Another trap: burst smoothing. A naive Token Bucket might allow a 5x burst in one second, which downstream databases hate. Always add a secondary per-second limiter before the database layer.

Always Log Every Decision
If you're not logging every rate limit decision with client ID, endpoint, timestamp, and verdict, you can't debug a production issue. Use structured logging — JSON lines — and index them in your logging system. When a customer complains about a 429, you should be able to query their exact requests and see why the limiter decided to reject.
Production Insight
A clock skew of 10 seconds causes sliding windows to shift, creating phantom overages.
Redis network partitions cause split-brain counting — two servers each think they're the only counter.
Rule: use NTP monitoring on all nodes and add a /health/rate-limiter endpoint that shows current state.
Key Takeaway
False positives ruin user trust; false negatives ruin your system.
Log every rate limit decision with a structured schema.
Add a health endpoint that exposes current counters — essential for debugging.

Where to Place Rate Limiters: API Gateway, Application Layer, or Load Balancer?

Rate limiters can live at different layers of your infrastructure. Each placement has trade-offs in terms of flexibility, latency, and management overhead.

### API Gateway (e.g., Kong, AWS API Gateway, Apigee) - Pros: Centralized configuration — change limits without touching application code. Often includes built-in support for token bucket or fixed window. Easiest to maintain across many services. - Cons: Limited algorithm choices — most gateways only offer fixed window or basic token bucket. Can become a bottleneck if not scaled. Adds unavoidable latency (typically 1-5ms). - Best for: Teams with multiple microservices that need uniform rate limit policies. Also good for external-facing APIs where a separate gateway already exists.

### Application Layer (within your service code) - Pros: Full control — you can use any algorithm, tie limits to business logic (e.g., different limits per user role), and implement per-endpoint cost-based limits. Lowest latency (in-memory counters). - Cons: Requires implementing and maintaining the rate limiter in each service. Harder to enforce cross-service limits (e.g., total requests across all endpoints). Code duplication risk. - Best for: Services with custom or complex rate limit needs. Also for latency-sensitive endpoints where an extra network hop is unacceptable.

### Load Balancer (e.g., Nginx, HAProxy, AWS ALB) - Pros: Very low overhead — load balancers are optimized for fast packet processing. Can rate-limit before requests even reach your application, providing a hard outer defense. Some support connection limiting and request rate limiting. - Cons: Limited to simple algorithms (mostly fixed window or connection-based). Cannot inspect request bodies or custom headers deeply. Difficult to implement per-user or per-tier limits (load balancers typically operate at IP level). - Best for: First line of defense against DDoS or brute-force attacks. Use for coarse-grained IP-based limits before applying finer-grained limits in the application.

### Recommendation: Layered Approach Start with a load balancer for IP-based rate limiting (protect against DDoS). Add an API Gateway for tenant-based limits if you have multiple services. Then implement application-layer rate limiting with Token Bucket for user-specific logic. This gives you defense in depth.

Production Pattern: Defense in Depth
Use all three layers: Load Balancer for IP-based DDoS protection, API Gateway for per-tier global limits, and Application Layer for per-user granularity. Each layer adds a small latency overhead but provides redundancy if one layer fails.
Production Insight
One team I worked with relied solely on an API Gateway for rate limiting. When the gateway went down during a traffic spike, there was no backup — the entire API was exposed. Adding a simple in-app fallback limiter prevented the outage on the next incident.
Key Takeaway
Placement is not either/or — use a layered strategy. Start with load balancer coarse limits, add gateway for multi-service policies, and finish with application-level per-user logic.
● Production incidentPOST-MORTEMseverity: high

Black Friday Rate Limit Misconfiguration Caused 30-Minute API Outage

Symptom
Customers on the 'Enterprise' tier started receiving 429 errors during peak hours, even though they were within their 10,000 req/min limit. Support tickets flooded in within 2 minutes.
Assumption
The team assumed the rate limiter was correctly reading the 'Enterprise' rule from the config store (1000 req/min), but a hot-reload bug had left the old rule active.
Root cause
A configuration change to increase Enterprise limits from 1000 to 10000 req/min was deployed via a hot-reload script that updated the in-memory rule store. However, the script only updated a partial set of server nodes — 3 out of 10 nodes still served the old limit. Round-robin load balancing meant 30% of requests hit the misconfigured nodes and got rejected.
Fix
Rolled back the configuration change, restarted all nodes, and deployed a new rule store with a versioned schema and atomic swap. Added a GET /health/rate-limiter endpoint that exposed current rules per tier. The monitoring team added a Prometheus alert on per-node rejection rates >1%.
Key lesson
  • A hot-reloadable rule store can diverge across nodes — always version your rules and validate they match across all instances.
  • Rolling updates of rate limit rules should be gradual, with automated rollback on anomaly detection.
  • Add a health endpoint that returns current limits; it saves hours during post-mortem analysis.
Production debug guideTriage the most common rate limiter failures in production4 entries
Symptom · 01
Client gets 429 even though they are within documented limits
Fix
Check the X-RateLimit-Limit and X-RateLimit-Remaining headers on the rejected request. Verify the client ID being used for limiting (might be IP instead of user ID). Compare across multiple nodes to detect configuration drift.
Symptom · 02
Rate limiter allows too many requests (false negatives)
Fix
Check Redis connectivity — if Redis is down and code falls back to ALLOW, you’re unthrottled. Inspect Redis key TTLs: if EXPIRE didn't set correctly due to pipeline error, counters persist forever. Also check clock skew (NTP) — a skewed clock shifts windows.
Symptom · 03
Intermittent 429s for the same client across different endpoints
Fix
Verify you're not double-counting: a global limiter and a per-endpoint limiter both counting the same request? Use a single counter per client per endpoint, or apply a multiplicative penalty.
Symptom · 04
Support tickets from a single large company about access issues
Fix
Check if limit key is IP address — a corporate NAT routes thousands of employees through one IP. Switch to authentication-based user ID. Also check if the company is hitting a per-IP rate limit designed for unauthenticated traffic.
★ Quick Debug Cheat Sheet: Rate LimiterThree commands to diagnose rate limiter health in under 30 seconds
429s across all clients
Immediate action
Check rate limiter health endpoint (if any) — verify rules and counters.
Commands
curl http://localhost:8080/health/rate-limiter
redis-cli --scan --pattern 'rate_limit:*' | head -20
Fix now
Restart rate limiter service; if Redis is down, restart Redis or fallback to local limiter.
No rate limiting happening (burst allowed)+
Immediate action
Check if Redis is reachable and has counters. Fallback to ALLOW without logging?
Commands
redis-cli ping
redis-cli get 'rate_limit:test:123'
Fix now
If Redis down, restart Redis; if fallback bug, push config with fail-closed for critical endpoints.
Clock skew suspected+
Immediate action
Check NTP sync on all rate limiter nodes.
Commands
timedatectl status | grep 'synchronized'
ntpq -p
Fix now
Restart NTP service or configure chrony; a skew >5 seconds causes phantom overages.
🔥

That's Components. Mark it forged?

10 min read · try the examples if you haven't

Previous
API Gateway
7 / 18 · Components
Next
Service Discovery