Multi-Level Caching: Stop Thundering Herds and Slash Latency in Production
Multi-level caching reduces latency and prevents cache stampedes.
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
Multi-level caching places a small, fast cache (like an in-memory map) in front of a larger, slower cache (like Redis). This reduces load on the slow cache and database, and prevents thundering herds when a popular key expires.
Think of a busy coffee shop. The barista (L1 cache) keeps the most popular drinks ready in a small rack. If a customer orders something not on the rack, they check the fridge (L2 cache) — bigger but slower. If it's not there either, they brew fresh (database). This way, most orders are served instantly, and the fridge isn't slammed every time someone wants a latte.
You've got a Redis cluster that handles 100k ops/s. Then a viral tweet hits. Your cache miss rate spikes. Redis CPU goes to 100%. Your database connection pool is exhausted. The site goes down. The classic rookie mistake? A single-level cache. One layer to rule them all — and fail them all. Multi-level caching is the pattern that stops this madness. It's not just about speed; it's about survival under load. After this article, you'll be able to design a multi-level cache that handles traffic spikes without melting your backend, and you'll know exactly which consistency model to use so you don't serve stale data to paying customers.
Why One Cache Isn't Enough: The Thundering Herd Problem
A single cache layer (say Redis) works fine until a popular key expires and thousands of requests all miss simultaneously. They all hit the database, which falls over. This is the thundering herd. Multi-level caching solves this by having a small, very fast L1 cache (in-process memory) that can absorb the herd. The L1 cache has a short TTL (seconds), so even if it expires, only one request goes to L2 or DB to repopulate it, while others block briefly (or get a stale value). Without this, your database dies. I've seen it happen to a payment service at 3am — the cache TTL was 1 hour, and when it expired, 500 concurrent requests hit the DB. The fix: add an L1 cache with a 5-second TTL and use a mutex to prevent concurrent rebuilds.
ConcurrentHashMap<String, Lock>) to avoid this. Otherwise, a miss on key 'A' blocks key 'B' requests.L1 Cache: In-Process vs. Distributed — Pick Your Poison
Your L1 cache lives in the same process as your application. That means it's fast (nanoseconds) but it's also per-instance. If you have 10 instances, each has its own copy. That's fine for read-heavy workloads where consistency isn't critical. But if you need strong consistency, you'll need to invalidate L1 across all instances — which is a distributed invalidation problem. The alternative is a distributed L1 (like Redis with read-replicas), but that adds latency. My rule of thumb: if your read-to-write ratio is >100:1 and you can tolerate seconds of staleness, use in-process L1. If writes are frequent or you need near-real-time consistency, skip L1 and go straight to L2 with a shorter TTL.
recordStats()) in production. Monitor hit rate, miss rate, and load time. If L1 hit rate is below 80%, your TTL is too short or cache size too small. If it's above 99%, you might be caching too aggressively — stale data risk.L2 Cache: Redis, Memcached, or Something Else?
Your L2 cache is the shared, durable cache. Redis is the default choice for most — it's fast, supports data structures, and has built-in replication. Memcached is simpler and more memory-efficient for pure key-value, but lacks persistence and replication. For most production systems, Redis wins because you can recover from a restart without a cold cache. But don't use Redis as a primary database — I've seen teams store everything in Redis and then lose data on failover. Use Redis as a cache with a TTL, and always have a database behind it. Also, watch out for hot keys: a single key that gets hammered can saturate a Redis shard. Solution: shard by key, or use local L1 to absorb the heat.
Consistency Models: Write-Through vs. Write-Behind vs. Cache-Aside
You have three main patterns for keeping your multi-level cache consistent with the database. Cache-Aside: application reads from cache, on miss loads from DB and populates cache. Writes go to DB, then invalidate cache. This is the most common and simplest. Write-Through: every write goes through cache to DB — cache is always consistent but adds latency. Write-Behind: writes go to cache first, then asynchronously to DB — fast writes but risk data loss if cache crashes. For multi-level, I recommend Cache-Aside with L1 invalidation on writes. Write-Through is too slow for L1 (you'd block on every write). Write-Behind is acceptable only if you can tolerate losing the last few seconds of writes.
Handling Cache Stampedes with Probabilistic Early Expiration
Even with multi-level caching, a cache stampede can still happen if a key expires simultaneously across all L1 instances. The fix: add jitter to TTLs, or use probabilistic early expiration (like Redis' SETEX with random TTL). Better yet, implement 'recompute before expiry' — when a key is within 10% of its TTL, allow one thread to refresh it while others get the stale value. This is called 'cache refresh-ahead'. Caffeine supports this natively with refreshAfterWrite. Use it.
refreshAfterWrite, the cache never expires — it just refreshes. That means stale data lives forever if the refresh fails. Always pair with expireAfterWrite as a safety net.When Multi-Level Caching Is Overkill — And What to Use Instead
Multi-level caching adds complexity: you now have two caches to manage, invalidate, and monitor. For low-traffic services (<100 req/s), a single Redis cache with a reasonable TTL is fine. For read-heavy workloads with strict consistency, skip L1 and use Redis with read-replicas. For write-heavy workloads, caching might not help at all — consider a write-optimized database instead. My rule: if your L1 hit rate is below 50%, the overhead of maintaining L1 isn't worth it. Measure first, then optimize.
The 4GB Container That Kept Dying
- Never use a plain ConcurrentHashMap as a cache.
- Always set a maximum size and TTL, even in dev.
--hotkeys option. 2. Add L1 cache for those keys. 3. Shard Redis or increase replicas. 4. Reduce TTL to spread load.curl localhost:8080/actuator/metrics/cache.gets?tag=name:l1jstat -gcutil <pid> 1000Key takeaways
Interview Questions on This Topic
How does multi-level caching handle concurrent cache misses for the same key? What failure mode does it prevent?
Frequently Asked Questions
20+ years shipping large-scale distributed systems. Lessons pulled from things that broke in production.
That's Components. Mark it forged?
3 min read · try the examples if you haven't