Distributed Caching: How to Stop Your Database From Begging for Mercy
Distributed caching explained with production patterns, failure modes, and code.
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
Distributed caching stores hot data across a cluster of servers, reducing database load and latency. Use it when a single cache server can't handle your traffic or you need high availability. The go-to tools are Redis and Memcached, with Redis offering richer data structures and persistence.
Imagine a library where the most popular books are kept at a front desk. A single desk can only serve so many people. Distributed caching is like having multiple front desks across the building, each holding copies of the most requested books. If one desk gets crowded, you go to another. If a desk burns down, the others still have the books. The librarian uses a map (consistent hashing) to decide which desk holds which book, so you always know where to go.
Your database is screaming. It's 2 AM, traffic spiked, and your primary DB is at 100% CPU serving the same 10,000 rows over and over. You've got a cache — a single Redis instance — but it's also melting because every request hits it for the same keys. You need distributed caching. Not because it's trendy, but because your architecture is begging for it.
Distributed caching solves a specific problem: your data access pattern has a hot set that exceeds what one machine can handle. It spreads the load across multiple cache nodes, so no single node becomes the bottleneck. It also gives you fault tolerance — when a node dies, you don't lose the entire cache.
By the end of this, you'll be able to design a distributed cache that survives traffic spikes, node failures, and even your own deployment mistakes. You'll know when to use Redis Cluster vs. Memcached, how consistent hashing actually works, and the exact failure modes that will bite you in production.
Why Your Single Redis Instance Won't Cut It
You've outgrown single-node caching. The problem isn't just capacity — it's throughput. A single Redis instance can handle ~100K ops/sec on a good day. When your traffic demands 500K ops/sec, you need to split the load. But naive sharding (like key % N) is a trap. When you add or remove a node, almost every key remaps to a different server. That means a cache storm — all clients miss simultaneously and hit the database. I've seen this take down a production system at 3 AM. The fix is consistent hashing, which minimizes remapping when the cluster changes.
Cache Topologies: Side Cache vs. Read-Through vs. Write-Through
How your application talks to the cache matters as much as the cache itself. The most common pattern is side cache (aka cache-aside): your app checks cache first, on miss it loads from DB and populates cache. Simple, but you have to handle stale data yourself. Read-through cache (like Redis with a custom module or a proxy) hides the DB call behind the cache — your app just asks the cache, and the cache fetches from DB if missing. Write-through cache updates the cache and DB in the same transaction — ensures consistency but adds latency. Write-behind (write-back) updates cache immediately and DB asynchronously — fast writes but risk of data loss on crash. I've used write-behind for a high-throughput analytics pipeline where losing a few seconds of data was acceptable. For a payments system? Never. Use write-through or side cache with strong consistency checks.
Eviction Policies: What Happens When Memory Runs Out
Your cache has finite memory. When it's full, something has to go. The default in Redis is noeviction — it just returns errors on writes. That's a production disaster waiting to happen. You must set maxmemory-policy. The most common is allkeys-lru: evict the least recently used key regardless of TTL. For caches with TTLs, volatile-lru evicts only keys with TTL set. allkeys-lfu (least frequently used) works well for workloads with skewed access patterns. I once saw a team use noeviction on a Redis instance that stored session data. When memory filled up, users couldn't log in. The fix was adding maxmemory 4gb and maxmemory-policy allkeys-lru. Also, monitor eviction rate via INFO command — if it's high, your cache is too small or your TTLs are too long.
Consistency and Stale Data: The CAP Trade-Off
Distributed caches are eventually consistent by nature. When you update data in the DB, the cache might serve stale data for a while. How stale? That depends on your invalidation strategy. The simplest approach is TTL-based expiry: set a reasonable TTL (e.g., 5 minutes) and accept that data can be up to 5 minutes stale. For stricter consistency, use write-through or publish-subscribe invalidation (e.g., Redis Pub/Sub to broadcast cache invalidation events). But even then, there's a window between DB write and cache invalidation where a read could get stale data. In practice, for most systems, a short TTL (seconds to minutes) is good enough. For systems that can't tolerate any staleness (like inventory in an e-commerce checkout), you might skip the cache entirely for that data or use a distributed lock to synchronize reads and writes. I've seen a booking system that cached seat availability with a 30-second TTL — double bookings happened. They switched to a write-through cache with Redis transactions (MULTI/EXEC) to atomically update DB and cache.
When Not to Use Distributed Caching
Distributed caching isn't free. It adds complexity: network latency (even localhost has overhead), serialization costs, and operational burden (monitoring, backups, cluster management). Don't use it if your entire dataset fits in memory on a single node and your traffic is moderate. A single Redis instance with 32GB RAM can handle most small-to-medium applications. Also, don't cache data that changes every millisecond — the cache invalidation overhead will kill you. For example, real-time stock prices: by the time you cache and serve, the price is stale. Better to use a streaming system. Finally, if your read-to-write ratio is close to 1:1, caching might not help much — you're just adding latency to every write. Measure your actual access patterns before adding a cache layer.
The 4GB Container That Kept Dying
- Never use simple modulo for distributed caching.
- Always set an eviction policy.
- And always plan for node failure — it's not if, but when.
redis-cli INFO stats | grep evicted_keys. If high, increase maxmemory or reduce TTLs. 2. Check if keys are expiring too fast: redis-cli INFO stats | grep expires. 3. Verify hash distribution: use redis-cli --cluster check for Redis Cluster.redis-cli CONFIG GET maxmemory. 2. Check eviction policy: redis-cli CONFIG GET maxmemory-policy. 3. If noeviction, change to allkeys-lru. 4. Reduce TTLs or add more nodes.redis-cli INFO stats | grep evicted_keysredis-cli INFO stats | grep keyspace_hitsKey takeaways
Interview Questions on This Topic
How does Redis Cluster handle a network partition? What happens to reads and writes during a split?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.
That's Components. Mark it forged?
4 min read · try the examples if you haven't