Cache Stampede Prevention — Mutex vs Background Refresh
One expired key triggered 20,000 DB queries during a flash sale.
20+ years shipping production code across the stack, with years spent interviewing engineers. Everything here is grounded in real deployments.
- Caching stores frequently accessed data in fast memory (RAM) to reduce database load
- Core components: storage (hash map), eviction policy (LRU), consistency strategy (TTL)
- Distributed caching adds consistent hashing for sharding across nodes
- Performance: in-memory read ~1ms vs database ~10-50ms — a 10-50x improvement
- Production threat: a cache stampede can collapse your DB when a hot key expires under high concurrency
- Biggest mistake: treating cache as a database — no TTL leads to stale data and OOM crashes
Imagine you're a chef who gets asked for the same recipes dozens of times a day. Instead of flipping through your giant cookbook every single time, you write the ten most-requested recipes on a sticky note and pin it to the fridge. That sticky note is your cache — fast to read, always nearby, but limited in space. When someone asks for a new recipe that's not on the note, you look it up in the big book and decide which sticky note to replace. That's literally how a caching system works.
Every millisecond matters at scale. When Netflix serves 200 million subscribers or Twitter surfaces a trending tweet, a database query that takes 50ms repeated ten thousand times per second will buckle your infrastructure and light your AWS bill on fire. Caching is the single highest-leverage tool in a backend engineer's kit, yet most engineers treat it as an afterthought — a Redis call dropped in after the database is already struggling. The engineers who design caches thoughtfully are the ones who build systems that survive virality.
The core problem caching solves is the impedance mismatch between how fast your application needs data and how fast your storage layer can produce it. Disk-based databases are optimised for durability and complex queries, not for microsecond reads of the same user profile record fifty thousand times per minute. A well-placed cache absorbs that repeated read pressure, serves data from memory at nanosecond speed, and lets your database do what it's actually good at — handling writes and complex aggregations.
By the end of this article you'll be able to walk into a system design interview and explain cache placement strategies, eviction policies and their trade-offs, cache invalidation approaches and their consistency guarantees, how distributed caches like Redis Cluster work internally, and the production failure modes (stampedes, poisoned caches, thundering herds) that separate senior engineers from mid-levels. You'll have working Java code you can reason about, and you'll know exactly what the interviewer is fishing for when they ask 'how would you cache this?'
What a Caching System Interview Actually Tests
A caching system interview evaluates your ability to design a layer that stores frequently accessed data in fast memory (e.g., Redis, Memcached) to reduce latency and load on a primary datastore. The core mechanic is a trade-off: you accept eventual consistency for O(1) reads instead of O(n) database queries. The interviewer wants to see you reason about hit ratios, TTLs, eviction policies (LRU, LFU), and failure modes like thundering herds or cache stampedes.
In practice, you must decide between write-through, write-around, and write-back strategies. Write-through ensures consistency but adds write latency; write-back improves write throughput at the risk of data loss. Key properties that matter are cache invalidation granularity (do you evict a single key or a whole partition?) and the staleness window — how long can a stale value be served before the next refresh? A common benchmark: a 95% hit ratio reduces read latency from 50ms to under 5ms, but a stampede can spike database connections to 10x normal.
Use caching when reads dominate writes and the data changes infrequently — think user sessions, product catalogs, or configuration. It's critical in systems where a single database query costs 100ms+ and the cache can serve in under 1ms. Avoid caching for rapidly mutating data (e.g., real-time stock ticks) unless you accept bounded staleness. In production, a poorly designed cache can cause cascading failures: a single key expiry triggers thousands of concurrent recomputations, overwhelming the database.
Anatomy of a Distributed Cache: More Than Just a Key-Value Store
In an interview, you're not just 'using Redis'; you're designing a high-availability, low-latency data layer. A production-grade caching system consists of three pillars: Storage (usually in-memory Hash Maps or B-Trees), an Eviction Policy to handle capacity limits, and a Consistency Strategy to ensure the cache doesn't serve stale data while the database has moved on.
When we talk about distributed caching, we introduce a fourth pillar: Partitioning (Sharding). Since a single node can't hold all the data or handle all the traffic, we use Consistent Hashing to distribute keys across multiple nodes. This minimizes data movement when a node is added or removed — a critical detail that distinguishes a Senior candidate from a Junior one.
Write Strategies: Balancing Speed and Safety
How you update the cache determines your system's consistency.
- Write-Through: Data is written to the cache and the database simultaneously. High consistency, but adds latency to writes.
- Write-Around: Data is written only to the database. The cache is only updated on a 'miss'. This prevents the cache from being flooded with data that is rarely read.
- Write-Back (Write-Behind): Data is written to the cache first, and the database is updated asynchronously. This provides the highest performance but risks data loss if the cache node crashes before the DB is updated.
Cache Invalidation: The Hardest Problem in Computer Science
There are only two hard things in computer science: cache invalidation and naming things. Invalidation ensures that stale data is removed or updated when the underlying source changes. Common approaches:
- TTL-based: Set an expiry time. Simple but can serve stale data within the TTL window.
- Event-based invalidation: Publish a cache invalidation event when the DB is updated (e.g., Redis Pub/Sub, Kafka). Near real-time consistency.
- Write-through (already covered): Synchronously update cache on write — strong consistency but higher write latency.
- Read-repair: In distributed caches, if a stale value is read, the reader triggers an update. Used in Dynamo-style systems.
In practice, most systems use a combination: TTL as a safety net, event-based for critical data.
Distributed Cache Topologies: Replication vs Sharding
When a single node can't handle traffic or memory, you need multiple nodes. Two primary patterns:
Replication: Data is copied to multiple nodes (e.g., Redis Sentinel). High read throughput, but write amplification. Writes go to primary, then async to replicas — eventual consistency.
Sharding: Data is partitioned across nodes (e.g., Redis Cluster, Memcached with client-side hashing). Each node owns a subset of keys. Requires consistent hashing to minimize rehashing.
Trade-off: Replication is simpler but wastes memory (each node stores all data). Sharding is memory-efficient but adds complexity for cross-node operations (e.g., multi-key commands fail if keys are on different nodes).
In interviews, start with replication for read-heavy workloads, then scale to sharding when memory of a single node becomes the bottleneck.
Cache Failure Modes: What Breaks Your Cache in Production
Three classic failure modes every senior engineer must explain:
Cache Stampede (Thundering Herd): A hot key expires, thousands of requests try to repopulate it simultaneously, overloading the database. Fix: Use a mutex/lease to allow only one thread to rebuild; others wait. Or use probabilistic early expiration (like Twitter's 'cache roulette').
Cache Penetration: Requests for non-existent keys (e.g., user IDs that don't exist) bypass cache and hit the DB every time. If these are common (e.g., DDOS attacks), the DB gets hammered. Fix: Cache 'null' results with a short TTL (e.g., 1 minute) or use a Bloom filter upfront.
Cache Avalanche: A large number of keys expire at the same time (e.g., all users cached at midnight with same TTL). DB gets a sudden load spike. Fix: Add random jitter to TTLs so they expire at different times. Or use a secondary local cache to absorb the initial misses.
Cache Eviction: Why Your "Hot" Data Stays Cold
You've got a cache. It's full. Something has to go. The eviction policy you choose decides which data survives and which gets nuked. This isn't academic — it's the difference between a snappy API and a thundering herd of cache misses that crush your database.
LRU (Least Recently Used) is the default for most production caches. It kicks out the item nobody touched longest. Works great for workloads with temporal locality — think news feeds or session data. But watch out: one-off batch jobs scanning your cache can poison LRU by touching old entries, making them look "recent" while your real hot data gets evicted.
LFU (Least Frequently Used) tracks access frequency instead of recency. Better for content that's popular consistently but accessed infrequently — like a config file read once per hour. The downside? LFU burns memory tracking counters for every key. Implement it with a min-heap or approximate counts (count-min sketch) to keep overhead sane.
Don't overthink this. Start with LRU. Set a TTL. Monitor your miss ratio. If you see hot keys cycling out, switch to LFU or add a small second-level cache. Your cache isn't smart — you have to feed it the right strategy.
Types of Cache: Where You Park Your Hot Data Matters
Every cache lives somewhere. Pick the wrong spot and you're adding latency instead of removing it. Here's the real breakdown.
Application Server Cache lives in-process — right inside your Java heap or Python dict. Fastest possible read (microseconds) because there's no network hop. Problem? You scale horizontally and now each node has its own cache. A request hits node A, caches a user profile. Same user hits node B — cache miss. You burn database queries warming up every new server.
Distributed Cache like Redis or Memcached sits outside your app servers. All nodes share one logical cache. Slower reads (milliseconds) but consistent hit rates. The trade-off: network calls add latency, and your cache becomes a single point of failure unless you shard or replicate. Every production system I've inherited had a Redis cluster that went down at 3 AM — plan for it.
CDN / Edge Cache is for static or semi-static content — images, CSS, API responses with long TTLs. Your users are global but your database is in us-east-1. Edge nodes cache at 200+ locations worldwide. User in Tokyo gets the response from Tokyo edge node — no cross-Pacific round trip. Use this for anything that doesn't change every second.
Global Cache is a hybrid — one logical cache but physically distributed. Usually a write-through to a central node with read replicas close to users. Complex to set up, but necessary for systems like Google or Netflix where consistency and speed both matter.
Rule of thumb: start with application cache for local hot data. Add distributed cache when you need consistency across servers. Add CDN when your users are global and your data is mostly read.
Black Friday Cache Stampede: How a Single Expired Key Took Down the Checkout DB
- Always protect cache-miss paths with a mutex or lease to prevent stampedes.
- Hot keys should use a shorter TTL with background refresh, not a single long TTL.
- Monitor cache hit rate as a leading indicator of DB load.
redis-cli -h <host> INFO stats | grep -E 'keyspace_(hits|misses)'redis-cli --bigkeys — find keys consuming most spaceKey takeaways
Common mistakes to avoid
6 patternsNot setting a TTL
EXPIRE in Redis or @Cacheable TTL configurations.Ignoring Cache Penetration
Hardcoding cache logic in Business Services
Over-caching small objects
Using `key % N` for distributed sharding
Not setting maxmemory on Redis
maxmemory (e.g., 80% of RAM) and maxmemory-policy (default: noeviction — change to allkeys-lru).Interview Questions on This Topic
Design a Least Frequently Used (LFU) cache with O(1) get and put operations. (Requires a nested doubly linked list or a frequency map + linked list.)
Frequently Asked Questions
20+ years shipping production code across the stack, with years spent interviewing engineers. Everything here is grounded in real deployments.
That's System Design Interview. Mark it forged?
7 min read · try the examples if you haven't