Thrashing in OS – A Java App's Cache That Tripped 80% RAM
A single scheduled job pushed cache to 80% RAM, triggering thrashing: CPU at 100%, iowait >80%, DB timeouts.
- Thrashing is a death spiral: the OS spends more time swapping pages than executing user code
- Root cause: combined working sets of all processes exceed physical RAM
- Detection: high iowait + high page faults + low throughput = thrashing
- Fix: reduce multiprogramming, add RAM, or enforce per-process memory limits
- Biggest mistake: treating high CPU as compute-bound when it's actually I/O wait
Imagine you're cooking five dishes at once in a tiny kitchen with only two burners. You keep moving pots on and off the stove so frantically that nothing actually cooks — you spend all your time shuffling pots, not cooking. That's thrashing: the OS is so busy swapping memory pages in and out of RAM that it never gets any real work done. The 'pots' are memory pages, the 'burners' are RAM slots, and 'cooking' is executing your actual program instructions.
Thrashing is one of those OS phenomena that sounds academic right up until it silently kills a production server at 3 AM. You'll see CPU usage pinned at 100%, but application throughput drops to near zero. Disk I/O goes through the roof. Users see timeouts. Engineers stare at dashboards wondering why a machine that 'should' handle the load is completely falling apart. The culprit is almost never the application logic — it's the memory subsystem in full meltdown mode.
What is Thrashing in OS?
Thrashing occurs when the virtual memory subsystem is in a constant state of paging. This happens when the sum of the 'Working Sets' of all active processes exceeds the available physical RAM. The Operating System attempts to maintain high CPU utilization by increasing the degree of multiprogramming; however, as more processes are added, the memory available to each decreases. Eventually, processes spend more time waiting for the pager to swap memory in and out of disk than they do executing instructions.
At this tipping point, CPU utilization collapses. The OS sees the idle CPU and mistakenly tries to start even more processes to 'fix' the low utilization, which accelerates the death spiral.
The core mechanism: page fault handling triggers disk I/O. Disk I/O is thousands of times slower than RAM access. When page faults happen too frequently, the CPU spends most of its time context-switching and waiting for I/O completions, rather than executing user code.
sar -B regularly — a sudden jump in page faults per second is your early warning.The Death Spiral: Why CPU Utilization Collapses
Here's what happens step by step when thrashing takes hold:
- The OS runs out of free page frames.
- Every page fault now requires evicting a page to disk.
- The paging disk becomes a bottleneck. Disk queues fill up.
- CPU utilization drops because the CPU is waiting for I/O completions.
- The OS scheduler sees a low CPU utilization percentage.
- It assumes the CPU is underutilized and starts more processes.
- New processes allocate more memory, increasing the total working set.
- More page faults, more disk I/O, even less CPU for actual work.
- Throughput collapses to near zero. The system is effectively deadlocked.
This self-reinforcing cycle was first studied formally in the 1970s, but it still kills production servers today. The root cause is always a mismatch between the total memory demand and the physical memory available.
- Processes = chefs, each with a recipe (working set).
- RAM = the counter space where chefs can prep ingredients.
- Disk = the refrigerator — takes 100x longer to fetch ingredients.
- When too many chefs work at once, the counter overflows. Chefs keep running to the fridge (page faults).
- The stove (CPU) sits idle while chefs wait for ingredients. The head chef (OS) hires more chefs to 'fix' the idle stove — making it worse.
Detecting Thrashing in Production
In a production environment, you don't wait for a crash; you watch the metrics. The tell-tale sign of thrashing is high Disk Wait (iowait) coupled with high Page Fault rates. If you see your CPU 'Steal' or 'Wait' metrics spiking while your application throughput (Requests Per Second) flatlines, you are likely thrashing.
- iowait (from
topor/proc/stat): % of time CPU is idle waiting for disk I/O. >20% is a red flag. - Page faults per second (
sar -B): minor faults (PF_MAJ) and major faults (PF_MAJ). Major faults cause disk reads. - Swap in/out rates (
vmstatcolumns si/so): any non-zero value means active paging. - Memory pressure (
/proc/meminfo): if Active(anon) + Inactive(anon) is near total RAM, you're at the edge. - Application throughput (RPS): a sudden drop while CPU stays high is a classic thrashing signature.
major_faults > 100 per second and iowait > 15% averaged over 5 minutes.vmstat is a warning. Zero swap doesn't rule out thrashing — the system may be page-cache evicting.Prevention: The Working Set and Locality Principle
To prevent thrashing, the OS relies on the Locality Principle. Temporal locality suggests that if a memory location is referenced, it will likely be referenced again soon. Spatial locality suggests that nearby memory locations will be referenced soon. Thrashing happens when a process's execution pattern lacks locality, forcing the OS to jump all over the disk.
- Working Set Model: Track each process's active page set. If the total working set exceeds RAM, block new processes or suspend one.
- Page Fault Frequency (PFF) control: Set a threshold for acceptable page fault rate. If a process exceeds it, allocate more frames (if available) or swap it out.
- Memory cgroups: In Linux, use
memory.maxto cap per-process memory. In Docker, use--memoryand--memory-swap. - Swappiness tuning: Set
vm.swappiness=1to discourage swapping unless absolutely necessary. - Avoid memory overcommit: Overcommitting RAM makes thrashing more likely under pressure.
stress-ng --vm --vm-bytes 90% in staging to verify your limits work before thrashing hits production.Effective Mitigation When Thrashing Starts
When you confirm thrashing in production, you need immediate action and then a structural fix.
Immediate (buy time) - Kill the largest memory consumer: ps aux --sort=-%mem | head -5 then kill -9 <pid>. - Drop page caches: echo 3 > /proc/sys/vm/drop_caches (only if you have clean file cache to reclaim). - Reduce swappiness: sysctl vm.swappiness=1 (may not help immediately if already swapping).
Medium-term (stabilize) - Temporarily stop non-critical services. Reduce the degree of multiprogramming. - Add more RAM if hardware allows. Cloud: attach memory-optimized instance type. - Adjust JVM heap sizes: reduce -Xmx to keep total working set below physical RAM.
Long-term (prevent recurrence) - Implement memory cgroups for all processes. In Kubernetes, set resource limits on every container. - Use page fault frequency as a scaling metric for batch jobs. - Review data structures for locality: use array of structs vs struct of arrays, pack hot fields together. - Test under memory pressure: use stress-ng in staging to validate your memory limits.
The 3 AM Pager: A Java App That Collapsed Under Memory Pressure
- Thrashing can be triggered by a single process expanding its working set unexpectedly.
- Always cap per-process memory limits in production — JVM flags alone aren't enough without a cgroup boundary.
- CPU at 100% does not mean the CPU is computing. Check iowait and page fault rates first.
Key takeaways
Common mistakes to avoid
4 patternsMisinterpreting high CPU usage as heavy computation
Trying to solve thrashing by adding more processes or threads
Ignoring locality of reference in data structures
Relying on swap space as a cheap alternative to RAM
Interview Questions on This Topic
Explain the relationship between the 'Degree of Multiprogramming' and CPU utilization. At what point does the curve drop?
Frequently Asked Questions
That's Operating Systems. Mark it forged?
4 min read · try the examples if you haven't