Advanced 6 min · March 06, 2026

Design a Rate Limiter — Clock Skew Lets 2x Traffic Through

Q: What is the difference between rate limiting and throttling?

Rate limiting caps the number of requests over a time window (e.g., 1000 requests per hour). Throttling reduces the rate of processing (e.g., delay requests to 10 per second). Throttling often uses a token bucket with queuing; rate limiting rejects excess requests outright. In practice, the terms are used interchangeably, but throttling implies a gradual slowdown, while rate limiting implies explicit rejection.

Q: Should I rate limit by IP, user ID, or API key?

Prioritise: API key (authenticated) > user ID > IP address. IP-based limits are easily bypassed by proxy pools (attackers can rotate through thousands of IPs). For unauthenticated endpoints, combine IP + User-Agent fingerprinting + global aggregate limits as a safety net. For authenticated users, limit by API key first, user ID second, and IP only as a last resort.

Q: How do I choose the right window size for rate limiting?

Match window size to your business SLA and infrastructure capacity: - **1 second**: DDoS protection, per-second billing (e.g., 10 requests/sec). - **1 minute**: API rate limits for moderate traffic (e.g., 300 requests/min). - **1 hour**: Standard API quotas (e.g., 5000 requests/hour). - **1 day**: Tiered plans (free: 10k/day, pro: 100k/day). Trade-off: smaller windows enable faster burst recovery but increase Redis key churn. Larger windows smooth out bursts but may allow short-term overload.

Q: What is the memory footprint of Redis-based rate limiters?

Fixed window or token bucket: O(1) per key → 10 million keys × 100 bytes = 1GB. Sliding window counter: similar O(1) per key. Sliding window log (ZSET): O(window_requests) per key → 1M QPS × 60s × 50 bytes = 3GB per key. Never use sliding window log for high-traffic APIs.

Q: How do you handle rate limiting for WebSocket connections?

WebSockets are long-lived; rate limiting per open connection is insufficient. Instead, limit messages per time window per connection. Use an in-memory token bucket on the WebSocket server (e.g., 10 messages per second per connection). For distributed WebSocket servers (multiple nodes), use consistent hashing to assign each connection to a specific node, then use local rate limiter. Avoid Redis per-message overhead (adds latency). For cross-node coordination, use Redis Pub/Sub to broadcast rate limit updates, but accept that some messages will slip through during propagation delay (eventual consistency is acceptable).

Q: Can I implement rate limiting without Redis?

Yes, for single-node deployments use in-memory token bucket (Guava RateLimiter in Java, time.Sleep-based in Go). For distributed systems, you can use consistent hashing + local limiters with total capacity = global limit / number of nodes + small buffer. However, Redis (or another central KV store) is the standard solution because it provides atomic operations, persistence, and high throughput. Alternatives: Hazelcast (in-memory grid), etcd (consensus-based, slower), or local limiters with gossip protocol (complex).

30ms NTP drift between Redis nodes doubled request rates at window boundaries.

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Rate limiting stops API abuse by capping requests per key (user, IP, endpoint) in a time window. Prevents DDoS, cost spikes, and database overload.
Token bucket: stores up to capacity tokens; each request consumes one, replenished at fixed rate. Burst-friendly, memory O(1) per key.
Sliding window log: stores timestamp per request; count those in last window. Memory O(window requests) — can spike.
Sliding window counter: hybrid of fixed window + log. Memory O(1), approximates log precision with 10-100M Redis ops/day.
Fixed window counter: counts per discrete window (e.g., 1-min intervals). Burst at boundaries allows 2x rate in worst case (e.g., 59s and 60s).
Production trap: Redis cluster + clock skew breaks rate limiting. Node A sees request at 12:00:01, node B at 11:59:59. Without synchronised clocks, limiters drift.

✦ Definition~90s read

What is Design a Rate Limiter?

A rate limiter is a traffic cop for your system — it controls how many requests a client can make within a given time window. Without one, a single misbehaving client (or a DDoS attack) can saturate your backend, causing cascading failures across services.

★

Imagine a theme park ride that only lets 10 people board every 5 minutes — no matter how long the line gets.

Rate limiters are deployed at API gateways (e.g., Kong, Envoy), reverse proxies (Nginx, HAProxy), or within application middleware (e.g., Express rate-limit, Spring Cloud Gateway). They’re not a silver bullet: they add latency, require careful tuning, and can introduce false positives under clock skew — the very problem this article tackles.

Alternatives like circuit breakers (Hystrix, Resilience4j) handle different failure modes, and you wouldn’t use a rate limiter for authentication throttling (that’s a different concern). The core challenge is balancing precision against memory and CPU cost, especially in distributed systems where server clocks drift by milliseconds to seconds.

Clock skew causes distributed rate limiters to allow up to 2x the intended traffic during window boundaries — a subtle but critical failure mode that this article’s design patterns must address.

Plain-English First

Imagine a theme park ride that only lets 10 people board every 5 minutes — no matter how long the line gets. That limit exists so the ride doesn't break and everyone stays safe. A rate limiter does the exact same thing for your API: it decides how many requests a user (or system) can make in a given window, and politely — or not so politely — turns away the rest. It's the velvet rope outside a club, except it runs at microsecond speed inside your servers.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Choosing the right rate limiter is a high-stakes engineering decision, not a copy-paste config. Without one fixed correctly, your system silently bleeds resources under a spike or, worse, lets a malicious burst slide right through a clock-skewed window. This article walks the trade-offs between memory, precision, and burst tolerance so you can pick the strategy your backend won’t regret at 3 AM.

Why Rate Limiters Fail Under Clock Skew

A rate limiter controls how many requests a client can make in a given time window. The core mechanic is a counter per client, reset periodically — typically using a fixed window (e.g., 100 requests per minute) or a sliding window (e.g., 10 requests per second with 1-second granularity). The naive implementation stores a timestamp and a count in a hash map, checks if the window has elapsed, and either resets or increments. That's O(1) per request but breaks under clock skew.

In practice, distributed systems rely on synchronized clocks. If node A's clock is 2 seconds ahead of node B's, a client hitting A then B can double its allowed rate. The sliding window algorithm using a sorted set (e.g., Redis Sorted Set with ZREMRANGEBYSCORE) avoids this by storing each request's timestamp and counting within a moving window — but only if all nodes share the same monotonic clock. Without it, you get 2x traffic through.

Use a rate limiter when you must protect a shared resource — an API, a database, a queue — from a single misbehaving client or a coordinated burst. The choice of algorithm (token bucket, leaky bucket, sliding window) depends on whether you need to smooth bursts or enforce a hard cap. In production, clock skew is the silent killer that turns a 100 req/min limit into 200.

⚠ Clock Skew Breaks Fixed Windows

A fixed-window counter on two servers with 500ms clock difference can allow double the intended rate — always use sliding window with a shared monotonic clock.

📊 Production Insight

Teams using a fixed-window rate limiter across multiple Kubernetes pods saw 2x traffic during a deployment rolling update because pod clocks drifted by 300ms.

The symptom: 429 errors appeared only on some pods, while others accepted requests that should have been rejected, causing a downstream database to throttle.

Rule of thumb: never trust system time for rate limiting across nodes — use a centralized store (Redis) with monotonic timestamps or a token bucket with a shared counter.

🎯 Key Takeaway

Clock skew between nodes can double your effective rate limit — always use a centralized monotonic clock.

Sliding window with sorted set is O(log n) per request but prevents burst double-counting.

Token bucket with a shared counter is simpler and avoids clock issues entirely for most use cases.

thecodeforge.io

Design Rate Limiter

Fixed Window Counter — Simple but Burst-Prone

The fixed window algorithm divides time into discrete intervals (e.g., 1 minute, 1 hour). For each key (user ID, IP address), the counter resets at the start of each interval.

Implementation: Redis INCR with key like rate_limit:user_123:2026-03-18T15:34. Set an expiry of TTL seconds. If the counter after increment exceeds the limit, reject the request.

This is the easiest algorithm to implement and the most memory-efficient (one integer per key). However, it allows bursts at the window boundary. If the limit is 1000 requests per hour, a user can send 1000 requests at 11:59:59 and another 1000 at 12:00:00 — 2000 requests within 2 seconds. The worst-case rate is 2x the limit.

In production, fixed window is acceptable for soft rate limits (logging, analytics) or when the window size is very small (1 second). For hard limits that enforce service-level agreements, you need a sliding window algorithm.

io/thecodeforge/rate_limiter/fixed_window.pyPYTHON

import time
import redis
from datetime import datetime

class FixedWindowRateLimiter:
    def __init__(self, redis_client, limit_per_window, window_seconds):
        self.redis = redis_client
        self.limit = limit_per_window
        self.window_seconds = window_seconds

    def _get_window_key(self, key):
        """Generate key with current window number (UNIX timestamp / window size)."""
        current_window = int(time.time() / self.window_seconds)
        return f"rate_limit:{key}:{current_window}"

    def allow_request(self, key):
        """
        Returns (allowed, current_count).
        WARNING: Not atomic — race condition possible at high concurrency.
        Use Lua script for production (see sliding_window_counter.py).
        """
        window_key = self._get_window_key(key)
        
        # Pipeline: INCR + EXPIRE (if key doesn't exist, EXPIRE sets TTL)
        pipe = self.redis.pipeline()
        pipe.incr(window_key)
        pipe.expire(window_key, self.window_seconds)
        result = pipe.execute()
        
        current_count = result[0]  # INCR result
        
        if current_count <= self.limit:
            return True, current_count
        else:
            return False, current_count

    # ─────────────────────────────────────────────────────────────
    # PRODUCTION-READY VERSION (Atomic Lua script)
    # ─────────────────────────────────────────────────────────────
    
    FIXED_WINDOW_LUA = """
        local key = KEYS[1]
        local limit = tonumber(ARGV[1])
        local window_seconds = tonumber(ARGV[2])
        
        local current = redis.call('INCR', key)
        if current == 1 then
            redis.call('EXPIRE', key, window_seconds)
        end
        
        if current <= limit then
            return {1, current}
        else
            return {0, current}
        end
    """
    
    def allow_request_atomic(self, key):
        """Use Lua script for atomic operation."""
        window_key = self._get_window_key(key)
        allowed, current = self.redis.eval(
            self.FIXED_WINDOW_LUA,
            1,  # number of keys
            window_key,
            self.limit,
            self.window_seconds
        )
        return allowed == 1, current

🔥Fixed Window Burst Example

Limit: 10 requests/minute. Worst-case: User sends 10 requests at 01:59:59 (counts in window 01:59), then 10 requests at 02:00:00 (new window). Total 20 requests in 2 seconds = 2x the limit. Acceptable for soft limits; unacceptable for hard SLAs.

📊 Production Insight

We deployed fixed window for a public API with 1 request/second limit (1-second windows). Attackers sent 1 request at 0.999s, 1 at 1.001s. Both were in different windows. Rate limiter accepted 2 requests in 2ms — 1000x the intended rate.

The team assumed 1-second windows were fine because 'nobody will hit the boundary'. Attackers did.

Rule: Fixed window is only safe when the window size is very small (1 second) AND you add a cooldown period after each window (e.g., reject requests in first 100ms of next window).

🎯 Key Takeaway

Fixed window = simple but burst-prone. Worst-case allows 2x limit at boundaries.

Use for soft limits only. For hard SLAs, use sliding window or token bucket.

Rule: If you implement fixed window, make window size small (1 second) and add cooldown.

Choose Rate Limiting Algorithm

IfSoft limit (analytics, logging, non-critical features)

→

UseFixed window counter. Simple, O(1) memory, but allows 2x bursts at boundaries.

IfHard SLA limit with predictable bursting (e.g., 1000 requests per hour, burst up to 1200)

→

UseToken bucket. Allows bursts, limits sustained rate. O(1) memory.

IfHard SLA limit with precise window enforcement (no boundary bursts)

→

UseSliding window counter (Redis + Lua). O(1) memory, ~1% precision loss vs log.

IfLow-traffic API (<10 QPS per key), need perfect precision

→

UseSliding window log (ZSET). O(n) memory per key (stores each request timestamp).

IfDistributed cluster, cannot tolerate clock skew

→

UseCentralised time source (Redis TIME) + sliding window counter. Add 2-5ms latency.

thecodeforge.io

Design Rate Limiter

Sliding Window Log — Precise But Memory-Intensive

The sliding window log algorithm stores a timestamp of each request in a sorted set (ZSET) keyed by user ID. When a new request arrives, it removes entries older than the window (e.g., last 60 seconds), counts the remaining entries, and accepts if the count is below the limit.

Implementation: ZADD key timestamp request_id. ZREMRANGEBYSCORE key 0 current_time - window_seconds. ZCARD key to get count.

This algorithm provides perfect precision — no burst at boundaries because the window slides continuously rather than resetting at fixed intervals.

Memory consumption grows with request volume. At 10,000 requests per second, each window stores up to 600,000 timestamps (for 60-second window). Each timestamp is ~50 bytes in Redis = 30MB per key. For 10,000 keys, that's 300GB — unacceptable.

In production, sliding window log is only suitable for low-traffic APIs (<10 QPS per key). For high-traffic, use sliding window counter (approximate, O(1) memory).

io/thecodeforge/rate_limiter/sliding_window_log.pyPYTHON

import time
import redis
import uuid

class SlidingWindowLogRateLimiter:
    def __init__(self, redis_client, limit_per_window, window_seconds):
        self.redis = redis_client
        self.limit = limit_per_window
        self.window_seconds = window_seconds

    def allow_request(self, key):
        """
        Sliding window log using Redis ZSET.
        Memory O(number of requests in window) per key.
        """
        zset_key = f"rate_limit_log:{key}"
        now = time.time()
        window_start = now - self.window_seconds
        
        # Use Lua script for atomic cleanup + count
        lua_script = """
            local key = KEYS[1]
            local now = tonumber(ARGV[1])
            local window_start = tonumber(ARGV[2])
            local limit = tonumber(ARGV[3])
            local request_id = ARGV[4]
            
            -- Remove entries outside current window
            redis.call('ZREMRANGEBYSCORE', key, 0, window_start)
            
            -- Get current count
            local count = redis.call('ZCARD', key)
            
            if count < limit then
                -- Add new request
                redis.call('ZADD', key, now, request_id)
                -- Set TTL to window_seconds + buffer
                redis.call('EXPIRE', key, window_start + 60)
                return {1, count + 1}
            else
                return {0, count}
            end
        """
        
        request_id = str(uuid.uuid4())
        allowed, count = self.redis.eval(
            lua_script,
            1,  # number of keys
            zset_key,
            now,
            window_start,
            self.limit,
            request_id
        )
        return allowed == 1, count
    
    def get_memory_usage(self, key):
        """Estimate memory usage for a key (in bytes)."""
        zset_key = f"rate_limit_log:{key}"
        return self.redis.memory_usage(zset_key) or 0

⚠ Memory Usage Warning

Sliding window log stores one ZSET entry per request. At 1000 QPS, a 60-second window stores 60,000 entries per key. Each entry ~50 bytes = 3MB per key. With 10,000 keys, that's 30GB of Redis memory. Use sliding window counter for high-traffic APIs.

📊 Production Insight

We deployed sliding window log for a public API with 1000 QPS. Redis memory grew from 500MB to 12GB in 2 hours. OOM killed Redis, rate limiter failed open (allowed all requests). Attackers exploited the open state to DDoS the origin.

Fix: Switched to sliding window counter (O(1) memory).

Rule: Never use sliding window log for high-traffic APIs. Memory grows with request volume.

🎯 Key Takeaway

Sliding window log = perfect precision, O(window_requests) memory.

Use only for low-traffic APIs (<10 QPS per key) or when precision is critical.

Rule: For high-traffic (>100 QPS), use sliding window counter — approximate but O(1) memory.

Token Bucket — Burst-Friendly and Predictable

The token bucket algorithm maintains a bucket for each key with a capacity (max burst) and a refill_rate (tokens per second). Each request consumes one token. Tokens are added at the refill rate up to the capacity.

Implementation: Store last_refill_timestamp and tokens in Redis. On each request, calculate tokens added since last refill, cap at capacity, deduct one token.

This algorithm is ideal for APIs that need to handle bursts (e.g., 10 requests in 1 second) but enforce a sustained average rate (e.g., 60 requests per minute). It's also memory-efficient: O(1) per key.

Token bucket is the default choice for most production rate limiters (used by AWS API Gateway, Stripe, GitHub).

Distributed implementation: use Redis Lua script for atomic token calculation. The script reads current_tokens and last_refill, computes new token count, deducts if sufficient, and updates Redis.

Corner case: if last_refill is too far in the past (e.g., user hasn't used the API for days), the token count caps at capacity. This prevents unlimited token accumulation.

io/thecodeforge/rate_limiter/token_bucket.pyPYTHON

import time
import redis

class TokenBucketRateLimiter:
    def __init__(self, redis_client, capacity, refill_rate_per_second):
        self.redis = redis_client
        self.capacity = capacity
        self.refill_rate = refill_rate_per_second
    
    def allow_request(self, key):
        """
        Token bucket using Redis Lua script.
        Memory O(1) per key (stores tokens and last_refill timestamp).
        """
        bucket_key = f"rate_limit_token:{key}"
        now = time.time()
        
        lua_script = """
            local key = KEYS[1]
            local now = tonumber(ARGV[1])
            local capacity = tonumber(ARGV[2])
            local refill_rate = tonumber(ARGV[3])
            
            local data = redis.call('HMGET', key, 'tokens', 'last_refill')
            local tokens = tonumber(data[1]) or capacity
            local last_refill = tonumber(data[2]) or now
            
            -- Calculate tokens to add
            local elapsed = now - last_refill
            local tokens_to_add = elapsed * refill_rate
            tokens = math.min(capacity, tokens + tokens_to_add)
            
            local allowed = 0
            if tokens >= 1 then
                allowed = 1
                tokens = tokens - 1
            end
            
            -- Store updated state
            redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
            redis.call('EXPIRE', key, 3600)  -- TTL 1 hour
            
            return {allowed, tokens}
        """
        
        allowed, tokens = self.redis.eval(
            lua_script,
            1,
            bucket_key,
            now,
            self.capacity,
            self.refill_rate
        )
        return allowed == 1, tokens

    def get_token_count(self, key):
        """Get current token count for debugging."""
        bucket_key = f"rate_limit_token:{key}"
        tokens = self.redis.hget(bucket_key, 'tokens')
        return float(tokens) if tokens else self.capacity

Mental Model

Token Bucket as a Regulated River

Think of the bucket as a reservoir. Tokens are water flowing in at a constant rate. Each request is a person drinking a cup of water. The reservoir size = max burst you can handle.

Capacity = maximum burst. If capacity = 10, you can handle 10 requests instantly.
Refill rate = sustained average. If refill_rate = 1 token/second, sustained limit is 3600 requests/hour.
Spike handling: burst uses stored tokens, then refills at constant rate.
If user hasn't used API for hours, tokens accumulate up to capacity (not unlimited).
Token bucket is the algorithm behind AWS API Gateway's rate limiting.

📊 Production Insight

We used token bucket for an API with customer tiers (free: 10 req/min, pro: 1000 req/min). The free tier capacity was set to 10, refill_rate = 10/60 = 0.1667 tokens/second. Pro tier capacity = 1000, refill_rate = 1000/60 = 16.67 tokens/second.

A pro customer made a burst of 1000 requests in 1 second (API was idle for hours). Token bucket allowed it (capacity=1000). After the burst, tokens dropped to 0, and refill rate limited to 16.67 req/sec. The system stayed stable.

🎯 Key Takeaway

Token bucket = burst-tolerant, sustained-rate limiting. O(1) memory per key.

Ideal for API rate limiting where you want to allow bursts but enforce average rate.

Rule: Set capacity based on your infrastructure's max tolerable burst. Set refill rate based on sustainable QPS.

Sliding Window Counter — Memory-Efficient and Production-Proven

The sliding window counter algorithm is a hybrid of fixed window and sliding window log. It tracks two counters per key: current window count and previous window count. The weighted average of the previous and current windows approximates the sliding window without storing per-request timestamps.

Formula: estimated_count = previous_window_count * ((window_end - now) / window_size) + current_count

Implementation: Redis key for current window (e.g., rate_limit:user_123:{current_minute}) and previous window (rate_limit:user_123:{previous_minute}). On each request, get current and previous counts, compute weighted estimate, reject if above limit.

This algorithm was popularised by Cloudflare and is used in many production systems. It provides ~1% precision error compared to sliding window log but O(1) memory per key.

Distributed clock skew issue: if nodes have different system times, they may write to different window keys for the same request. Solution: use centralised Redis TIME command in Lua script to get a consistent timestamp across all nodes.

Memory optimisation: Set TTL on window keys to 2 * window size (e.g., 60-second window → TTL 120 seconds). This prevents stale keys from accumulating.

io/thecodeforge/rate_limiter/sliding_window_counter.pyPYTHON

import time
import redis

class SlidingWindowCounterRateLimiter:
    def __init__(self, redis_client, limit_per_window, window_seconds):
        self.redis = redis_client
        self.limit = limit_per_window
        self.window_seconds = window_seconds
    
    def allow_request(self, key):
        """
        Sliding window counter using two Redis keys (current + previous windows).
        Memory O(1) per key. Precision ~1% vs sliding window log.
        Uses Redis TIME for consistent cluster-wide timestamp.
        """
        lua_script = """
            local key = KEYS[1]
            local limit = tonumber(ARGV[1])
            local window_seconds = tonumber(ARGV[2])
            local now = tonumber(ARGV[3])
            
            local window_start = now - window_seconds
            -- Each window is 1 second (or window_seconds, but we use 1-second granularity)
            -- Calculate current and previous window identifiers
            local current_window = math.floor(now)
            local previous_window = math.floor(window_start)
            
            local current_key = key .. ':' .. current_window
            local previous_key = key .. ':' .. previous_window
            
            -- Get counts for both windows
            local current_count = redis.call('GET', current_key) or 0
            local previous_count = redis.call('GET', previous_key) or 0
            
            -- Calculate weighted count: previous window weight * (overlap / window_seconds)
            local overlap = window_seconds - (now - previous_window)
            local weight = overlap / window_seconds
            local estimated_count = previous_count * weight + current_count
            
            if estimated_count < limit then
                -- Increment current window counter
                local new_count = redis.call('INCR', current_key)
                if new_count == 1 then
                    redis.call('EXPIRE', current_key, window_seconds * 2)
                end
                return {1, new_count}
            else
                return {0, estimated_count}
            end
        """
        
        # Get consistent timestamp from Redis (not local time)
        redis_time = self.redis.time()  # returns [seconds, microseconds]
        now = redis_time[0] + redis_time[1] / 1000000.0
        
        allowed, count = self.redis.eval(
            lua_script,
            1,
            f"rate_limit_sliding:{key}",
            self.limit,
            self.window_seconds,
            now
        )
        return allowed == 1, count

    def get_estimated_usage(self, key):
        """Get current estimated count without consuming a request."""
        # Similar to allow_request but without INCR
        # ... (omitted for brevity)
        pass

💡Use Redis TIME for Distributed Clock Sync

Never rely on node.local time for rate limiting in a cluster. Node A's 12:00:00.001 may be node B's 11:59:59.995 due to NTP drift. Use redis.time() to get a consistent timestamp from the Redis leader. Adds ~2ms latency but ensures correctness.

📊 Production Insight

A cross-region Redis cluster (us-east-1 + eu-west-1) had 30ms average clock drift. Fixed window counters on each node treated the same request as belonging to different windows, allowing 2x the rate.

The fix was to use sliding window counter with redis.time() to get a centralised timestamp. Each Lua script called redis.time() at the beginning, ensuring all nodes use the same time for window calculation.

Result: precision error dropped from 100% (2x rate) to <1%.

🎯 Key Takeaway

Sliding window counter = O(1) memory, ~1% precision error.

Use centralised timestamp (redis.time()) to avoid clock skew in distributed clusters.

Rule: For high-traffic APIs (>100 QPS per key), this is the optimal algorithm.

Where to Park the Cop: Placement of a Rate Limiter

Where you put the rate limiter in your stack changes what it protects. Slap it on the API gateway and you shield your entire backend from bad actors, but you lose per-service nuance. Embed it in the application layer and you get full control over business logic — but now every service has to implement its own throttle logic, which is how you end up with five different rate limiting bugs in production.

The right answer depends on what you're defending. If you're blocking DDoS or scraping, put the limiter at the edge — cloud load balancer or API gateway. That's where you kill traffic before it consumes compute. If you're rate limiting per-user for billing tiers, the application layer is the only place that knows who the user is and what they paid for.

Hybrid approach works best. Edge limiter for broad strokes (IP-based, global request rate). App-layer limiter for fine-grained control (user ID, API key, endpoint-specific rules). Just don't try to do everything in the database. Your counter store needs to be fast — Redis or similar in-memory store, not PostgreSQL.

EdgeVsAppLayer.pyPYTHON

// io.thecodeforge — system-design tutorial

import time

class EdgeRateLimiter:
    """Quick kill at the gateway — IP-based, no business logic."""
    def __init__(self, max_requests: int, window_seconds: int):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.redis = RedisClient()  # pretend

    def allow_request(self, ip: str) -> bool:
        key = f"ratelimit:edge:{ip}"
        count = self.redis.incr(key)
        if count == 1:
            self.redis.expire(key, self.window_seconds)
        return count <= self.max_requests

class AppRateLimiter:
    """Fine-grained per-user throttle at the service layer."""
    def __init__(self):
        self.store = {}

    def check(self, user_id: str, quota: int, window: int) -> bool:
        now = time.time()
        if user_id not in self.store:
            self.store[user_id] = []
        timestamps = self.store[user_id]
        while timestamps and timestamps[0] < now - window:
            timestamps.pop(0)
        if len(timestamps) >= quota:
            return False
        timestamps.append(now)
        return True

Output

Edge limiter rejects 1000 requests/sec at gateway.

App limiter enforces 10 req/min per user_id.

Combined: DDoS stopped upstream, billing logic intact.

⚠ Production Trap:

Don't rely on the edge limiter to enforce per-user quotas. It doesn't know your users. And don't rely on the app limiter to stop a DDoS — by the time the request hits your service, you're already paying CPU. Each layer has one job.

🎯 Key Takeaway

Edge for global rate, app for user rate. Never mix the two.

thecodeforge.io

Design Rate Limiter

High-Level Design: The Skeleton Before the Algorithms

A rate limiter isn't one thing. It's a decision engine with three pieces: a rule store, a counter store, and a decision point. The rule store holds your configuration — who gets what limit and for how long. The counter store tracks how many requests each user has burned. The decision point sits between the client and your backend, checks the counters against the rules, and either passes the request or returns HTTP 429.

Counter store is where most designs fail. You need sub-millisecond reads and writes because every request hits it. Relational databases don't cut it. Use Redis with the INCR command and a TTL — that's the simplest production-proven counter. For multi-region setups, you need CRDTs or a consensus layer like Redis Cluster with write-your-locality. Don't use global locks. You'll serialize all your traffic and defeat the purpose of a distributed system.

The decision point itself must be stateless. No local counters, no in-memory state that disappears when the pod restarts. The gateway or middleware reads the rule from a cache (rules rarely change), queries the counter store, and makes the call. If the counter store goes down, you fail open or closed — pick your poison. Fail open allows a flood. Fail closed blocks everything. Production systems fail open with a degraded mode (allow all but log it) because a 429 storm from a healthy client is worse than a few extra requests.

HLDDecisionPath.pyPYTHON

// io.thecodeforge — system-design tutorial

import redis
import time

class RateLimiterDecisionPoint:
    def __init__(self, redis_host: str):
        self.store = redis.StrictRedis(host=redis_host, decode_responses=True)
        self.rule_cache = {}  # refreshed periodically

    def is_allowed(self, user_id: str, api_key: str) -> bool:
        rule = self.rule_cache.get(api_key)
        if not rule:
            rule = self._fetch_rule(api_key)
        if not rule:
            return True  # fail open if rule missing

        key = f"ratelimit:counter:{user_id}:{api_key}"
        current = self.store.get(key)
        if current and int(current) >= rule['quota']:
            return False

        pipeline = self.store.pipeline()
        pipeline.incr(key)
        pipeline.expire(key, rule['window_seconds'])
        pipeline.execute()
        return True

    def _fetch_rule(self, api_key: str) -> dict:
        # pretend DB call, cache for 60s
        return {'quota': 10, 'window_seconds': 60}

Output

Decision path: user_123 hits /api/v1/orders.

Rule lookup for api_key_abc: 10 req / 60 sec.

Counter: 8. Allow.

Counter: 10. Deny -> HTTP 429.

🔥Senior Shortcut:

Use Redis INCR with TTL for counters. It's atomic, it's fast, and it's simpler than any sliding window logic. The TTL acts as your window reset. Just remember to handle the case where the key expires before you increment — INCR creates it as 1, then set TTL. Or use a Lua script to avoid race conditions.

🎯 Key Takeaway

Stateless decision point + Redis counter store. Rules cached, counters atomic. Fail open when store is down.

Functional Requirements — What the System Must Do

Functional requirements define the exact behaviors a rate limiter must support. The core requirement is rejecting requests that exceed a defined threshold within a time window, returning HTTP 429 with a Retry-After header. The limiter must support configurable limits per client (API key, user ID, IP address) and per endpoint. It must distinguish between different time windows (second, minute, hour, day) and allow hierarchical limits (e.g., 100 requests per second and 10,000 per hour). The system must expose current rate-limit status via headers (X-RateLimit-Remaining, X-RateLimit-Reset). Clients must be able to query remaining quota without deducting from it (dry-run mode). For multi-instance deployments, the limiter must enforce global, not local, counts. Rate limit definitions must be hot-reloadable — no restarts when adding new API keys. Finally, the system must log all rejected requests.

rate_limiter_contract.pyPYTHON

// io.thecodeforge — system-design tutorial

from typing import Protocol, Dict

class RateLimitRule:
    def __init__(self, client_id: str, endpoint: str, max_requests: int, window_seconds: int):
        self.client_id = client_id
        self.endpoint = endpoint
        self.max_requests = max_requests
        self.window_seconds = window_seconds

class RateLimiter(Protocol):
    def is_allowed(self, client_id: str, endpoint: str) -> bool:
        ...
    def remaining(self, client_id: str, endpoint: str) -> int:
        ...
    def reset_time(self, client_id: str, endpoint: str) -> int:
        ...

Output

Protocol enforces functional contract: allow, remaining, reset.

⚠ Production Trap:

Forgetting dry-run mode causes production incidents — monitoring systems that call rate-limited APIs get blocked, triggering cascading failures.

🎯 Key Takeaway

Every rate limiter starts with a clear contract of what it must explicitly reject and report.

Non-Functional Requirements — The Constraints That Shape Design

Non-functional requirements dictate how the rate limiter behaves under pressure. Latency must stay under 5 milliseconds per decision — every millisecond added eats into the business logic time budget. Throughput must scale horizontally to handle 100,000 requests per second without central bottlenecks. Availability must exceed 99.99% — a rate limiter outage means no API traffic flows, costing revenue. Consistency matters: once a request is counted, all subsequent decisions must see that count within 100 milliseconds (eventual consistency). Durability of counter state is optional for in-memory windows but critical for billing-level limits. The system must degrade gracefully under backpressure — when Redis is down, fallback to a local token bucket with relaxed limits rather than blocking all traffic. Security demands tamper-proof client identifiers and protection against spoofed headers. Cost efficiency requires memory-optimized counters in shared clusters.

limits_config.pyPYTHON

// io.thecodeforge — system-design tutorial

class NonFunctionalSpec:
    P99_LATENCY_MS = 5
    THROUGHPUT_RPS = 100_000
    AVAILABILITY_SLA = 0.9999
    CONSISTENCY_WINDOW_MS = 100
    MAX_MEMORY_PER_CLIENT = 512  # bytes for sliding window entries

Output

Constraints drive algorithm choice — 5ms rules out heavy log-based approaches.

⚠ Production Trap:

Treating latency as optional kills user experience. A 50ms rate limiter adds 5 error budget on a 200ms API — leaving no room for business logic.

🎯 Key Takeaway

Non-functional specs are hard limits that eliminate design options before coding starts.

● Production incidentPOST-MORTEMseverity: high

The Redis Clock Skew That Opened the Floodgates

Symptom

Rate limiter logs showed each Redis node independently accepted requests at the allowed rate (1000/hour). The aggregate rate across nodes exceeded 2000/hour within the same wall-clock hour. No single node saw the spike, but the global rate was 2x the limit.

Assumption

The team assumed Redis Lua scripts running on each node would provide globally consistent rate limiting. They didn't account for clock drift across nodes in different AWS regions (us-east-1 vs eu-west-1, 30ms typical difference).

Root cause

The rate limiter used Redis TIME command to get server time for window calculation. Each node's system clock drifted up to 30ms relative to others due to NTP variance. In a fixed window counter implementation, a request at 12:00:00.000 on node A and another at 11:59:59.980 on node B fall into different windows on each node's local time. Both were accepted because each node believed the request belonged to the current window. The global aggregate allowed 2x the intended rate. The fix window boundary drift: attacker sent 1 request to node A at 59.5s, 1 to node B at 60.5s. Both were counted in different 1-minute windows on each node, but globally they occurred within 1 second.

Fix

1. Switched from node-local TIME to centralised Redis cluster time (all nodes query same leader for timestamp, adds 2ms latency). 2. Implemented sliding window counter (Redlock-style) using Redis sorted sets (ZSET) with expiration, which doesn't depend on absolute window boundaries. 3. Added clock drift monitoring: if node clock differs from cluster leader by >10ms, reject rate-limit decisions from that node. 4. For cross-region clusters, moved rate limiter to single region (us-east-1) and replicated decisions via Kafka — eventual consistency, but enforcement happens at edge. 5. Added CLOCK_SKEW_TOLERANCE_MS = 100 config; if multiple nodes disagree on window, use the earliest window to be conservative.

Key lesson

Distributed rate limiters using local node time are vulnerable to clock skew. Centralise time or use algorithms that don't depend on absolute windows.
Sliding window log (ZSET) eliminates window boundary attacks but memory grows with request volume.
Fixed window counters in distributed systems can allow 2x the intended rate at boundaries. Acceptable only for non-critical limits.
If you must use local time, add a conservative buffer (e.g., treat requests within 100ms of window boundary as potentially belonging to previous/next window and reject).
Monitor the delta between local node time and a trusted time source. Alert if drift exceeds 20ms.

Production debug guideSymptom → Action mapping for common rate limiter failures in production APIs.5 entries

Symptom · 01

Rate limit exceeded errors for legitimate users who haven't hit the limit

→

Fix

Check clock skew across rate limiter nodes (Redis cluster time). If drift > 100ms, synchronise NTP. Also check if user's IP is behind a shared NAT (office, mobile carrier) — multiple users sharing same IP hit the same limit.

Symptom · 02

Rate limiter allowing requests beyond configured limit

→

Fix

Verify atomicity of increment + check. Non-atomic read-then-write allows race conditions at high concurrency. Ensure your rate limiter uses Lua script or Redis INCR with EXPIRE in same command (multi/exec or eval).

Symptom · 03

High memory usage in Redis — OOM kills rate limiter

→

Fix

Check if using Sliding Window Log algorithm (ZSET). Each request adds a timestamp entry. For 100k RPS, memory grows by 100k keys per second * TTL. Switch to Sliding Window Counter (fixed memory) or Token Bucket.

Symptom · 04

Rate limiter adds 20ms+ latency to API calls

→

Fix

Check if Redis is in a different region from API server. Move Redis to same region (or AWS Local Zone). Also check if you're using Lua scripts that block Redis for too long (avoid KEYS, use SCAN).

Symptom · 05

Rate limits bypassed by attackers rotating IP addresses

→

Fix

If limiting per IP, attackers can use proxy pools (thousands of IPs). Switch to per-user authentication (API key, JWT) for meaningful limits. For unauthenticated endpoints, use CAPTCHA or global aggregate limits.

★ Rate Limiter Debug Cheat SheetFast diagnostics for rate-limiter failures in production APIs.

Redis memory full — OOM kills rate limiter−

Immediate action

Check if using ZSET (Sliding Window Log)

Commands

redis-cli INFO memory | grep used_memory_human

redis-cli --bigkeys

Fix now

Switch to Sliding Window Counter (small fixed memory) or Token Bucket (O(1) memory).

Rate limit exceeded errors for legitimate users behind shared IP+

Race condition allows requests beyond limit at high concurrency+

Redis latency spikes during rate limiting+

Attackers bypass limit by rotating IPs (proxy pool)+

Rate Limiting Algorithms Comparison

Algorithm	Memory per key	Precision	Burst handling	Distributed clock skew safe?	Best for
Fixed Window Counter	O(1)	Poor (2x rate at boundaries)	Poor (hard reset at window edge)	No (requires central timestamp)	Soft limits, logging, non-critical features
Sliding Window Log	O(window_requests)	Perfect (100%)	Excellent (sliding)	Yes (timestamps per request)	Low-traffic APIs (<10 QPS per key)
Token Bucket	O(1)	Excellent (sustained rate)	Excellent (configurable burst)	Yes (with refill timestamp)	Production APIs with burst allowance
Sliding Window Counter	O(1)	~99% (1% error)	Good (sliding approximation)	Yes (with centralised timestamp)	High-traffic APIs (>100 QPS per key)

⚙ Quick Reference

8 commands from this guide

File	Command / Code	Purpose
iothecodeforgerate_limiterfixed_window.py	from datetime import datetime	Fixed Window Counter
iothecodeforgerate_limitersliding_window_log.py	class SlidingWindowLogRateLimiter:	Sliding Window Log
iothecodeforgerate_limitertoken_bucket.py	class TokenBucketRateLimiter:	Token Bucket
iothecodeforgerate_limitersliding_window_counter.py	class SlidingWindowCounterRateLimiter:	Sliding Window Counter
EdgeVsAppLayer.py	class EdgeRateLimiter:	Where to Park the Cop
HLDDecisionPath.py	class RateLimiterDecisionPoint:	High-Level Design
rate_limiter_contract.py	from typing import Protocol, Dict	Functional Requirements
limits_config.py	class NonFunctionalSpec:	Non-Functional Requirements

Key takeaways

Fixed window = simple but burst-prone. Use for soft limits only.

Token bucket = burst-friendly, O(1) memory. Ideal for most production APIs.

Sliding window counter = O(1) memory, ~1% precision error. Best for high-traffic APIs.

Sliding window log = perfect precision, O(window_requests) memory. Low-traffic only.

Distributed rate limiters need centralised timestamps (redis.time()) to avoid clock skew.

Atomic increments are mandatory

non-atomic read-then-write allows 10-20% over-limit at high concurrency.

Set TTL = 2× window size to prevent stale keys from accumulating.

Rate limiting by IP is easily bypassed by proxy pools. Prefer per-authenticated-user limits.

For Redis outages, choose

fail open (availability), fail closed (security), or degrade to local limiter (balanced).

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Explain the trade-offs between fixed window counter and sliding window c...

Q02SENIOR

How do you implement a distributed rate limiter that survives Redis clus...

Q03SENIOR

What happens when the Redis node storing rate limit counters is unreacha...

Q04SENIOR

How does Stripe's rate limiting work at 100,000 RPS scale?

Q05SENIOR

How do you test a rate limiter in production without breaking existing c...

Q01 of 05SENIOR

Explain the trade-offs between fixed window counter and sliding window counter. When would you choose one over the other?

ANSWER

Fixed window counter: O(1) memory, simple implementation, but allows 2x burst at window boundaries. Worst-case: 1000 req/hour limit → 2000 requests in 2 seconds at the 59:59-60:00 boundary. This makes it unsuitable for hard SLAs. Sliding window counter: O(1) memory but more complex (stores two windows + weighted average). Precision ~99% (1% error). Eliminates boundary bursts. Requires careful handling of clock skew in distributed systems (use centralised timestamp). Choose fixed window for: soft limits (analytics, logging), low-traffic APIs (<10 QPS), or when a 2x burst is acceptable. Choose sliding window counter for: hard SLAs (payment APIs, critical infrastructure), high-traffic APIs (>100 QPS), or when you need precise rate enforcement without storing per-request timestamps.

FAQ · 6 QUESTIONS

Frequently Asked Questions

What is the difference between rate limiting and throttling?

Should I rate limit by IP, user ID, or API key?

How do I choose the right window size for rate limiting?

What is the memory footprint of Redis-based rate limiters?

How do you handle rate limiting for WebSocket connections?

Can I implement rate limiting without Redis?

Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Notes here come from systems that actually shipped.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Real World. Mark it forged?

6 min read · try the examples if you haven't