Senior 4 min · June 25, 2026

Distributed Caching: How to Stop Your Database From Begging for Mercy

Distributed caching explained with production patterns, failure modes, and code.

N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Production
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Distributed caching stores hot data across a cluster of servers, reducing database load and latency. Use it when a single cache server can't handle your traffic or you need high availability. The go-to tools are Redis and Memcached, with Redis offering richer data structures and persistence.

✦ Definition~90s read
What is Distributed Caching?

Distributed caching is a system where cache data is spread across multiple machines, providing low-latency access to frequently used data while offloading backend databases. It uses techniques like consistent hashing to distribute keys and handle node failures gracefully.

Imagine a library where the most popular books are kept at a front desk.
Plain-English First

Imagine a library where the most popular books are kept at a front desk. A single desk can only serve so many people. Distributed caching is like having multiple front desks across the building, each holding copies of the most requested books. If one desk gets crowded, you go to another. If a desk burns down, the others still have the books. The librarian uses a map (consistent hashing) to decide which desk holds which book, so you always know where to go.

Your database is screaming. It's 2 AM, traffic spiked, and your primary DB is at 100% CPU serving the same 10,000 rows over and over. You've got a cache — a single Redis instance — but it's also melting because every request hits it for the same keys. You need distributed caching. Not because it's trendy, but because your architecture is begging for it.

Distributed caching solves a specific problem: your data access pattern has a hot set that exceeds what one machine can handle. It spreads the load across multiple cache nodes, so no single node becomes the bottleneck. It also gives you fault tolerance — when a node dies, you don't lose the entire cache.

By the end of this, you'll be able to design a distributed cache that survives traffic spikes, node failures, and even your own deployment mistakes. You'll know when to use Redis Cluster vs. Memcached, how consistent hashing actually works, and the exact failure modes that will bite you in production.

Why Your Single Redis Instance Won't Cut It

You've outgrown single-node caching. The problem isn't just capacity — it's throughput. A single Redis instance can handle ~100K ops/sec on a good day. When your traffic demands 500K ops/sec, you need to split the load. But naive sharding (like key % N) is a trap. When you add or remove a node, almost every key remaps to a different server. That means a cache storm — all clients miss simultaneously and hit the database. I've seen this take down a production system at 3 AM. The fix is consistent hashing, which minimizes remapping when the cluster changes.

ConsistentHashingExample.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// io.thecodeforge — System Design tutorial

// Consistent hashing with virtual nodes
// Each physical node gets multiple virtual nodes on the hash ring
// to distribute load evenly.

import java.util.*;

class ConsistentHash<T> {
    private final int numberOfReplicas;
    private final SortedMap<Integer, T> circle = new TreeMap<>();

    public ConsistentHash(int numberOfReplicas, Collection<T> nodes) {
        this.numberOfReplicas = numberOfReplicas;
        for (T node : nodes) {
            add(node);
        }
    }

    public void add(T node) {
        // Add virtual nodes for each physical node
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.put(hash(node.toString() + i), node);
        }
    }

    public void remove(T node) {
        for (int i = 0; i < numberOfReplicas; i++) {
            circle.remove(hash(node.toString() + i));
        }
    }

    public T get(Object key) {
        if (circle.isEmpty()) return null;
        int hash = hash(key);
        // Find the first node clockwise from the key's hash
        if (!circle.containsKey(hash)) {
            SortedMap<Integer, T> tailMap = circle.tailMap(hash);
            hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
        }
        return circle.get(hash);
    }

    private int hash(Object key) {
        // Use a good hash function like Murmur3 or FNV-1a
        return key.hashCode() & 0x7fffffff; // Simple for demo, use better in prod
    }
}

// Usage: ConsistentHash<String> cacheNodes = new ConsistentHash<>(200, nodes);
Output
No direct output — this is a data structure. When you add/remove a node, only 1/N keys remap (N = number of nodes).
Production Trap: Hash Function Quality
Using Java's hashCode() for consistent hashing is a disaster. It's not uniformly distributed — you'll get hot nodes. Use Murmur3 or FNV-1a. Redis Cluster uses CRC16. Test your hash distribution with a representative key set before going to production.
Sharding Strategy Decision Tree
IfCluster size rarely changes (< 1 change/month)
UseSimple modulo (key % N) — but only if you can tolerate full cache flush on change.
IfCluster size changes frequently or needs auto-scaling
UseConsistent hashing with 100-200 virtual nodes per physical node.
IfNeed replication for high availability
UseRedis Cluster (built-in) or client-side consistent hashing with primary-replica pairs.
Distributed Caching Topologies and Trade-Offs THECODEFORGE.IO Distributed Caching Topologies and Trade-Offs From single-instance limits to eviction, consistency, and anti-patterns Single Redis Bottleneck Memory, throughput, and failover limits Side Cache Application manages cache aside DB Read-Through Cache Cache layer loads missing data Eviction Policies LRU, LFU, TTL when memory fills CAP Trade-Off Consistency vs. availability in stale data When Not to Cache Rarely-read or rapidly-changing data ⚠ Stale data from eventual consistency Use TTLs and cache invalidation patterns THECODEFORGE.IO
thecodeforge.io
Distributed Caching Topologies and Trade-Offs
Distributed Caching
Sharding: Naive vs. ConsistentTHECODEFORGE.IOSharding: Naive vs. ConsistentHow adding nodes affects your cache keysNaive (key % N)Adding a node reshuffles most keysCache hit rate drops to near zeroRequires full cache warmup on scaleSimple but production-unfriendlyConsistent HashingOnly 1/N keys move on node addHigh cache hit rate maintainedMinimal disruption during scalingIndustry standard for distributed cachesUse consistent hashing to avoid cache avalanche on reshardingTHECODEFORGE.IO
thecodeforge.io
Sharding: Naive vs. Consistent
Distributed Caching

Cache Topologies: Side Cache vs. Read-Through vs. Write-Through

How your application talks to the cache matters as much as the cache itself. The most common pattern is side cache (aka cache-aside): your app checks cache first, on miss it loads from DB and populates cache. Simple, but you have to handle stale data yourself. Read-through cache (like Redis with a custom module or a proxy) hides the DB call behind the cache — your app just asks the cache, and the cache fetches from DB if missing. Write-through cache updates the cache and DB in the same transaction — ensures consistency but adds latency. Write-behind (write-back) updates cache immediately and DB asynchronously — fast writes but risk of data loss on crash. I've used write-behind for a high-throughput analytics pipeline where losing a few seconds of data was acceptable. For a payments system? Never. Use write-through or side cache with strong consistency checks.

CacheAsidePattern.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// io.thecodeforge — System Design tutorial

// Cache-aside (side cache) pattern for a checkout service
// Uses Redis for caching user cart data.

import redis.clients.jedis.Jedis;
import com.google.gson.Gson;

public class CartService {
    private final Jedis cache;
    private final Database db;
    private final Gson gson = new Gson();
    private static final String CART_PREFIX = "cart:";
    private static final int TTL_SECONDS = 3600; // 1 hour

    public CartService(Jedis cache, Database db) {
        this.cache = cache;
        this.db = db;
    }

    public Cart getCart(String userId) {
        String key = CART_PREFIX + userId;
        String cached = cache.get(key);
        if (cached != null) {
            return gson.fromJson(cached, Cart.class);
        }
        // Cache miss — load from DB
        Cart cart = db.loadCart(userId);
        if (cart != null) {
            // Populate cache with TTL to avoid stale data
            cache.setex(key, TTL_SECONDS, gson.toJson(cart));
        }
        return cart;
    }

    public void updateCart(String userId, Cart cart) {
        // Write-through: update DB first, then invalidate cache
        db.saveCart(userId, cart);
        cache.del(CART_PREFIX + userId);
        // Next read will miss and fetch fresh data
    }
}

// Output: When getCart is called, if cache miss, DB is queried and cache is populated.
// On update, cache is invalidated to ensure consistency.
Output
On first call for user 'abc': cache miss → DB query → cache set. Subsequent calls: cache hit. On update: cache deleted → next read fetches fresh.
Senior Shortcut: Cache Invalidation Strategy
Don't try to update cache on writes — just delete the key. Let the next read populate it. This avoids race conditions where a stale write overwrites a fresh one. It's called 'lazy invalidation' and it's the simplest correct strategy for most systems.
Cache-Aside FlowTHECODEFORGE.IOCache-Aside FlowApplication checks cache first, falls back to DBApp RequestCheck cache for keyCache HitReturn cached dataCache MissQuery databasePopulate CacheWrite result to cacheReturn DataServe to client⚠ Stale data risk: you must handle invalidation yourselfTHECODEFORGE.IO
thecodeforge.io
Cache-Aside Flow
Distributed Caching

Eviction Policies: What Happens When Memory Runs Out

Your cache has finite memory. When it's full, something has to go. The default in Redis is noeviction — it just returns errors on writes. That's a production disaster waiting to happen. You must set maxmemory-policy. The most common is allkeys-lru: evict the least recently used key regardless of TTL. For caches with TTLs, volatile-lru evicts only keys with TTL set. allkeys-lfu (least frequently used) works well for workloads with skewed access patterns. I once saw a team use noeviction on a Redis instance that stored session data. When memory filled up, users couldn't log in. The fix was adding maxmemory 4gb and maxmemory-policy allkeys-lru. Also, monitor eviction rate via INFO command — if it's high, your cache is too small or your TTLs are too long.

RedisEvictionConfig.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# io.thecodeforge — System Design tutorial

# Redis configuration for production cache
# Set in redis.conf or via CONFIG SET

maxmemory 4gb
maxmemory-policy allkeys-lru
# Also consider:
# maxmemory-policy volatile-lru  # only evict keys with TTL
# maxmemory-policy allkeys-lfu   # evict least frequently used

# Monitor evictions:
# redis-cli INFO stats | grep evicted_keys
# If evicted_keys per second > 100, increase maxmemory or reduce TTLs.
Output
No direct output — configuration. Use INFO stats to see evicted_keys.
Never Do This: noeviction in Production
Setting maxmemory-policy to noeviction (the default!) means Redis will reject write commands when memory is full. Your cache becomes read-only. Applications that try to set keys will get OOM errors. Always set an eviction policy. If you need to guarantee certain keys are never evicted, use a separate Redis instance with noeviction for those keys only.

Consistency and Stale Data: The CAP Trade-Off

Distributed caches are eventually consistent by nature. When you update data in the DB, the cache might serve stale data for a while. How stale? That depends on your invalidation strategy. The simplest approach is TTL-based expiry: set a reasonable TTL (e.g., 5 minutes) and accept that data can be up to 5 minutes stale. For stricter consistency, use write-through or publish-subscribe invalidation (e.g., Redis Pub/Sub to broadcast cache invalidation events). But even then, there's a window between DB write and cache invalidation where a read could get stale data. In practice, for most systems, a short TTL (seconds to minutes) is good enough. For systems that can't tolerate any staleness (like inventory in an e-commerce checkout), you might skip the cache entirely for that data or use a distributed lock to synchronize reads and writes. I've seen a booking system that cached seat availability with a 30-second TTL — double bookings happened. They switched to a write-through cache with Redis transactions (MULTI/EXEC) to atomically update DB and cache.

WriteThroughConsistency.systemdesignSYSTEMDESIGN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// io.thecodeforge — System Design tutorial

// Write-through with Redis transaction for seat booking
// Ensures cache and DB are updated atomically.

import redis.clients.jedis.Jedis;
import redis.clients.jedis.Transaction;

public class BookingService {
    private final Jedis cache;
    private final Database db;

    public boolean bookSeat(String eventId, String seatId, String userId) {
        String cacheKey = "seat:" + eventId + ":" + seatId;
        // Watch the key for optimistic locking
        cache.watch(cacheKey);
        String currentOwner = cache.get(cacheKey);
        if (currentOwner != null) {
            // Seat already booked (in cache)
            return false;
        }
        // Attempt to book in DB
        boolean dbSuccess = db.bookSeat(eventId, seatId, userId);
        if (!dbSuccess) {
            cache.unwatch();
            return false;
        }
        // Update cache atomically
        Transaction t = cache.multi();
        t.set(cacheKey, userId);
        t.expire(cacheKey, 3600); // 1 hour TTL
        List<Object> results = t.exec();
        if (results == null) {
            // Transaction failed (key was modified by another client)
            // Rollback DB? Or handle conflict
            return false;
        }
        return true;
    }
}

// Output: Returns true if booking succeeded. Cache and DB are consistent.
Output
Returns true if booking succeeded. Cache and DB are consistent.
Interview Gold: CAP Theorem and Caching
In a distributed cache, you typically sacrifice strong consistency for availability and partition tolerance. That's why TTLs exist — they bound the inconsistency window. If an interviewer asks 'how do you handle cache consistency?', say: 'I use TTLs for bounded staleness, and for critical data, I use write-through with atomic transactions or a distributed lock.'

When Not to Use Distributed Caching

Distributed caching isn't free. It adds complexity: network latency (even localhost has overhead), serialization costs, and operational burden (monitoring, backups, cluster management). Don't use it if your entire dataset fits in memory on a single node and your traffic is moderate. A single Redis instance with 32GB RAM can handle most small-to-medium applications. Also, don't cache data that changes every millisecond — the cache invalidation overhead will kill you. For example, real-time stock prices: by the time you cache and serve, the price is stale. Better to use a streaming system. Finally, if your read-to-write ratio is close to 1:1, caching might not help much — you're just adding latency to every write. Measure your actual access patterns before adding a cache layer.

Senior Shortcut: Measure Before You Cache
Add metrics to your application: cache hit rate, latency percentiles, and database query frequency. If your hit rate is below 80%, your cache is likely doing more harm than good. Use tools like Redis's INFO command or a metrics dashboard (Datadog, Grafana).
● Production incidentPOST-MORTEMseverity: high

The 4GB Container That Kept Dying

Symptom
Every 45 minutes, one Redis node in our cluster would OOM-kill. The rest would spike to 200% CPU, then cascade. The entire payments service went down for 3 minutes each time.
Assumption
We assumed a memory leak in our application code — maybe we weren't setting TTLs correctly.
Root cause
We used a simple modulo-based sharding (key_hash % N). When a node died, the hash space shifted, causing a thundering herd as every client recomputed shards and hammered the remaining nodes. Also, we had no maxmemory-policy set, so Redis used default noeviction — it just crashed instead of evicting.
Fix
Switched to consistent hashing with 200 virtual nodes per physical node. Set maxmemory-policy allkeys-lru. Added a second replica for each key range. Pre-warmed cache on restart by replaying last 5 minutes of traffic from a log.
Key lesson
  • Never use simple modulo for distributed caching.
  • Always set an eviction policy.
  • And always plan for node failure — it's not if, but when.
Production debug guideSystematic recovery paths for the failure modes engineers actually hit.3 entries
Symptom · 01
High cache miss rate (>30%) under load
Fix
1. Check eviction rate: redis-cli INFO stats | grep evicted_keys. If high, increase maxmemory or reduce TTLs. 2. Check if keys are expiring too fast: redis-cli INFO stats | grep expires. 3. Verify hash distribution: use redis-cli --cluster check for Redis Cluster.
Symptom · 02
Cache node OOM kill (out-of-memory)
Fix
1. Check maxmemory setting: redis-cli CONFIG GET maxmemory. 2. Check eviction policy: redis-cli CONFIG GET maxmemory-policy. 3. If noeviction, change to allkeys-lru. 4. Reduce TTLs or add more nodes.
Symptom · 03
Thundering herd on DB after cache node failure
Fix
1. Implement consistent hashing with replication. 2. Add circuit breaker on DB calls (e.g., Hystrix). 3. Pre-warm cache on node restart by replaying recent traffic from logs. 4. Use a local cache (L1) to absorb initial misses.
★ Distributed Caching Triage Cheat SheetFirst-response commands for when things go wrong — copy-paste ready.
Cache miss rate spike — `INFO stats` shows `keyspace_misses` high
Immediate action
Check if evictions are happening
Commands
redis-cli INFO stats | grep evicted_keys
redis-cli INFO stats | grep keyspace_hits
Fix now
Increase maxmemory by 20% or reduce TTLs. Set maxmemory-policy allkeys-lru if not set.
Redis OOM error in logs — `OOM command not allowed when used memory > 'maxmemory'`+
Immediate action
Check current memory and maxmemory
Commands
redis-cli INFO memory | grep used_memory_human
redis-cli CONFIG GET maxmemory
Fix now
Set maxmemory-policy allkeys-lru: redis-cli CONFIG SET maxmemory-policy allkeys-lru. Increase maxmemory if possible.
High latency on cache gets — `SLOWLOG` shows many slow commands+
Immediate action
Check for large value sizes or complex commands
Commands
redis-cli SLOWLOG GET 10
redis-cli --bigkeys
Fix now
Reduce value sizes (compress or split). Avoid O(N) commands like KEYS or SMEMBERS on large sets.
Cache cluster split-brain or inconsistent data+
Immediate action
Check cluster nodes and replication
Commands
redis-cli CLUSTER NODES
redis-cli CLUSTER INFO
Fix now
For Redis Cluster, ensure cluster-require-full-coverage no to allow partial availability. Manually failover if needed: redis-cli CLUSTER FAILOVER.
Feature / AspectRedis ClusterMemcached
Data StructuresStrings, hashes, lists, sets, sorted sets, streamsStrings only (key-value)
PersistenceRDB snapshots, AOF logsNone (volatile)
ReplicationBuilt-in (master-replica, cluster)No built-in replication (client-side)
ShardingAutomatic via hash slots (CRC16)Client-side consistent hashing
Eviction PoliciesMultiple (LRU, LFU, TTL, etc.)Only LRU
Use CaseRich data, persistence needed, complex operationsSimple key-value, high throughput, ephemeral data

Key takeaways

1
Distributed caching is about spreading load and surviving failures
not just adding memory. Use consistent hashing to minimize disruption when nodes change.
2
Always set an eviction policy (allkeys-lru) and a TTL. Without them, your cache becomes a memory leak that eventually crashes.
3
Cache consistency is a spectrum. Use TTLs for bounded staleness, write-through for strong consistency, and lazy invalidation (delete on write) for simplicity.
4
Measure before you cache. If your hit rate is below 80%, you're adding complexity without benefit. Profile your access patterns first.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does Redis Cluster handle a network partition? What happens to reads...
Q02SENIOR
When would you choose Memcached over Redis Cluster in a production syste...
Q03SENIOR
What happens when a consistent hashing ring is imbalanced? How do you mi...
Q04JUNIOR
What is cache stampede and how do you prevent it?
Q05SENIOR
A production incident: your cache hit rate dropped from 95% to 40% after...
Q06SENIOR
Design a distributed cache for a social media feed that supports 10 mill...
Q01 of 06SENIOR

How does Redis Cluster handle a network partition? What happens to reads and writes during a split?

ANSWER
Redis Cluster uses a gossip protocol to detect partitions. If a majority of master nodes can't reach a minority, the minority stops accepting writes (to prevent split-brain). Reads are still served from replicas if configured. When the partition heals, the minority nodes sync from the majority. This means during a partition, some data may be unavailable for writes, but consistency is preserved.
FAQ · 4 QUESTIONS

Frequently Asked Questions

01
What is the difference between Redis and Memcached for distributed caching?
02
How do I choose between cache-aside and read-through caching?
03
How do I prevent cache stampede in Redis?
04
What happens to a distributed cache when a node fails?
N
Naren Founder & Principal Engineer

20+ years shipping large-scale distributed systems. Drawn from code that ran under real load.

Follow
Verified
production tested
June 25, 2026
last updated
1,663
articles · all by Naren
🔥

That's Components. Mark it forged?

4 min read · try the examples if you haven't

Previous
Cache Eviction Policies
20 / 23 · Components
Next
Multi-Level Caching