Homeβ€Ί Javaβ€Ί JVM Memory Issues in Production: Debugging Guide (OOM, GC, Leaks)

JVM Memory Issues in Production: Debugging Guide (OOM, GC, Leaks)

Where developers are forged. Β· Structured learning Β· Free forever.
πŸ“ Part of: Advanced Java β†’ Topic 13 of 28
JVM memory issues debugging guide with real production examples.
βš™οΈ Intermediate β€” basic Java knowledge assumed
In this tutorial, you'll learn:
  • Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging.
  • Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type.
  • GC tuning is about trade-offs: throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC.
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚑ Quick Answer
  • OOM errors have 5 types: Heap, Metaspace, Direct Memory, Stack Overflow, and GC Overhead β€” each has a different root cause and fix.
  • Always capture heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError β€” without it, you are guessing.
  • GC pauses above 200ms in latency-sensitive services indicate tuning problems β€” switch collectors or adjust generation ratios.
  • Memory leaks show as a sawtooth pattern that never returns to baseline after GC β€” analyze dominator tree in heap dump to find the root object.
  • Metaspace OOM usually means classloader leaks in hot-redeploy environments β€” not insufficient Metaspace size.
  • Production rule: set -Xmx to 70-75% of container memory β€” the remaining 25-30% covers off-heap, thread stacks, and OS page cache.
🚨 START HERE
Quick Debug Cheat Sheet β€” Start Here When It Is 2 AM
You are on-call. The service is down. Use this to triage in under 60 seconds before diving deeper.
πŸ”΄Pod killed (exit code 137) β€” no JVM error in logs
Immediate ActionContainer OOM killer β€” total memory exceeded container limit
Commands
kubectl describe pod <pod> | grep -A5 "Last State"
kubectl top pod <pod> --containers
Fix NowIncrease container memory limit to 1.43x your -Xmx, or reduce -Xmx to 70% of current container limit
πŸ”΄java.lang.OutOfMemoryError: Java heap space
Immediate ActionHeap is full β€” find what is consuming memory
Commands
jmap -histo:live <pid> | head -30
jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>
Fix NowAnalyze heap.hprof in Eclipse MAT β€” check dominator tree for leak suspects
🟠Response times spiking β€” service is slow but not crashed
Immediate ActionGC pauses are likely killing latency
Commands
jstat -gcutil <pid> 1000
tail -100 /var/log/jvm/gc.log
Fix NowIf Full GC count is rising, you have old gen pressure β€” leak or undersized heap
🟠CPU at 100% β€” service is thrashing
Immediate ActionExcessive GC or allocation rate exceeding throughput
Commands
jstat -gcutil <pid> 1000
top -Hp <pid>
Fix NowIf GC time% > 10%, GC is the bottleneck β€” fix the leak or increase heap
πŸ”΄java.lang.OutOfMemoryError: Metaspace
Immediate ActionClassloader leak β€” usually from hot-redeploy
Commands
jcmd <pid> VM.classloader_stats
jcmd <pid> GC.class_stats
Fix NowRestart JVM on redeploy. Check for static references to dynamic classloaders
πŸ”΄java.lang.OutOfMemoryError: Direct buffer memory
Immediate ActionNIO/Netty buffer allocation failed
Commands
jcmd <pid> VM.native_memory summary
Check -XX:MaxDirectMemorySize setting
Fix NowEnable Netty ResourceLeakDetector at PARANOID. Increase MaxDirectMemorySize or fix buffer leak
🟑java.lang.StackOverflowError
Immediate ActionInfinite recursion or call stack too deep
Commands
jstack <pid> | grep -A 50 "REPEATING_METHOD"
Check stack trace for repeating method signatures
Fix NowConvert recursion to iteration, or increase -Xss (costs more memory per thread)
πŸ”΄Service runs for hours then OOM β€” slow leak
Immediate ActionMemory leak β€” objects accumulating over time
Commands
jstat -gcutil <pid> 60000 (monitor every minute for 30 min)
jmap -histo:live <pid> > /tmp/histo1.txt (repeat after 1 hour)
Fix NowDiff the two histograms β€” the object type with the fastest-growing count is the leak source
Production IncidentThe Slow Leak That Killed Black Friday: HashMap Growth Under Concurrent LoadAn e-commerce checkout service leaked 200MB/hour during peak traffic. The team misdiagnosed it as insufficient heap and doubled -Xmx from 4GB to 8GB. The service ran for 6 more hours before OOM killing at the worst possible time β€” during Black Friday peak.
SymptomCheckout service OOM crashed at 8:47 PM on Black Friday. The heap dump showed 7.8GB of the 8GB heap consumed by a single ConcurrentHashMap inside io.thecodeforge.service.CheckoutSessionManager. The map had 14.2 million entries. Normal baseline was 50,000 entries.
AssumptionThe team initially assumed the heap was simply too small for Black Friday traffic. They doubled -Xmx from 4GB to 8GB and redeployed. The service ran for 6 hours before crashing again. The second heap dump showed the same pattern β€” CheckoutSessionManager holding 14+ million entries.
Root causeCheckoutSessionManager stored session objects in a ConcurrentHashMap with a user_id key. The session cleanup thread was supposed to evict expired sessions every 60 seconds. Under high load, the cleanup thread was starved β€” it ran on a shared thread pool with the request handlers. During peak traffic, the request threads consumed all CPU, and the cleanup thread never got scheduled. Sessions accumulated indefinitely. Each session object held references to the full cart, user profile, and payment token β€” approximately 500 bytes per entry. At 14.2 million entries, that was 7.1GB.
FixReplaced the cleanup-thread pattern with Caffeine cache using expireAfterAccess(30, TimeUnit.MINUTES). Caffeine handles eviction internally without a separate thread. Set maximumSize(500_000) as a hard cap. Added a monitoring alert when session count exceeds 100,000. The fix reduced steady-state memory from 4GB to 800MB and eliminated the leak entirely.
Key Lesson
Doubling heap without understanding the leak just delays the crash and makes the heap dump twice as large to analyze. Find the leak first, then right-size the heap.Never use a plain Map for session storage with manual cleanup. Use a cache library (Caffeine, Guava) with built-in TTL eviction.Background cleanup threads on shared thread pools get starved under load. If eviction is critical, give the cleanup thread a dedicated pool or use a library that does not need one.Monitor object counts, not just heap usage. A service using 60% heap with 14 million Map entries is in worse shape than one using 80% heap with 50,000 entries.Set a hard maximumSize on any unbounded collection that receives data from external sources. Unbounded growth is the root cause of most production OOMs.
Production Debug GuideSymptom-to-action guide for the memory issues you will actually encounter at 2 AM
java.lang.OutOfMemoryError: Java heap space — service crashes with OOM→Check if -XX:+HeapDumpOnOutOfMemoryError was set. If yes, analyze the heap dump with Eclipse MAT or jhat. Look at the dominator tree — the top object consuming memory is usually the leak source. If no heap dump was captured, add the flag immediately and wait for the next occurrence. In the short term, check jstat -gcutil to see if old gen is at 100% and not collecting.
java.lang.OutOfMemoryError: Metaspace — service crashes after multiple redeployments→Metaspace stores class metadata. A leak here means classloaders are not being garbage collected. Common in application servers (Tomcat, JBoss) with hot-redeploy. Check if your deployment pipeline redeploys without restarting the JVM. Fix: restart the JVM on redeploy, or investigate why old classloaders are still referenced. Increase -XX:MaxMetaspaceSize only as a temporary mitigation.
java.lang.OutOfMemoryError: Direct buffer memory — NIO or Netty service crashes→Direct memory is allocated outside the heap via ByteBuffer.allocateDirect(). The JVM tracks this separately. Check -XX:MaxDirectMemorySize (default is -Xmx value). Common cause: Netty ByteBuf not released, or NIO channels not closed. Use NativeMemoryTracking (NMT) with -XX:NativeMemoryTracking=detail to profile direct memory allocation. In Netty, enable ResourceLeakDetector at PARANOID level temporarily.
java.lang.StackOverflowError — thread crashes with deep recursion→Each thread has a fixed stack size set by -Xss (default 512KB-1MB depending on OS). The error means the call stack exceeded this size. Common cause: infinite recursion, or very deep recursive algorithms. Check the stack trace for repeating method signatures. Fix: convert recursion to iteration, or increase -Xss (costs more memory per thread — 1000 threads × 2MB = 2GB extra).
GC overhead limit exceeded — service becomes unresponsive, eventually OOM→The JVM spent more than 98% of the last few seconds doing GC and recovered less than 2% of heap. This means GC cannot free enough memory. Root cause is almost always a memory leak — objects are referenced and cannot be collected. Analyze heap dump for leak suspects. Temporary mitigation: increase -Xmx, but this only buys time. The fix is finding and eliminating the leak.
Service response time degrades gradually over hours, eventually OOM — no single leak object visible→This is a generational leak pattern. Objects promoted to old gen are never collected, but they are not a single large object — they are thousands of small objects from different code paths. Use jmap -histo:live periodically to track object count growth. Compare histograms over time. The object type with the fastest-growing count is the leak source. Check for unbounded caches, connection pools without limits, or thread-local variables not cleaned up.
Container killed by OOM killer (exit code 137) — no JVM OOM error logged→The OS killed the process because total memory (heap + off-heap + native) exceeded the container memory limit. The JVM did not OOM — the container did. Check if -Xmx is set to more than 75% of container memory. The remaining 25% covers thread stacks, metaspace, direct memory, JNI native memory, and OS page cache. Use NativeMemoryTracking to profile total JVM memory usage. Adjust container limit or reduce -Xmx.
GC pause times exceed 1 second — service has high tail latency→Full GC is pausing all application threads. Check which GC is active (PrintCommandLineFlags). If using Serial or Parallel GC, switch to G1GC (JDK 11+) or ZGC (JDK 15+). If already using G1GC, tune -XX:MaxGCPauseMillis (default 200ms), -XX:G1HeapRegionSize, and -XX:InitiatingHeapOccupancyPercent. Check if humongous allocations are causing premature GC — objects larger than 50% of a G1 region (default 1MB) are humongous.
Heap usage spikes to 90% then drops to 30% β€” normal or leak?β†’This is the expected sawtooth pattern IF the drop happens after a full GC cycle and returns to the same baseline. A leak shows as: baseline increases over time. Track old gen usage after each full GC. If post-GC old gen usage grows over hours/days, you have a leak. Use jstat -gcutil <pid> 1000 to monitor. The key metric is old gen usage after full GC, not peak usage.
Service runs fine with 100 TPS but OOMs at 1000 TPS β€” not a leak, just load?β†’Possibly, but verify. Check if heap usage after GC is the same at both load levels. If post-GC baseline is the same, the issue is allocation rate exceeding GC throughput. Options: increase heap, switch to a lower-latency GC (ZGC/Shenandoah), or reduce allocation rate by object pooling or caching. If post-GC baseline is higher at 1000 TPS, you have a load-dependent leak β€” objects are referenced longer under concurrency.
Memory usage grows slowly over days — no OOM yet, but trending upward→Early leak detection. Take periodic heap histograms with jmap -histo:live and compare. Use JFR (Java Flight Recorder) with -XX:StartFlightRecording to capture allocation patterns over time. Look for object types whose count increases monotonically. Set up monitoring alerts for old gen growth rate — alert if post-GC old gen grows more than 5% per hour.
OOM happens only in production, never in staging — same code, same -Xmx→Production data profiles differ from staging. Common causes: production has more unique users (larger session caches), more unique query patterns (larger query caches), or different traffic patterns (more concurrent connections). Compare object counts between environments using jmap -histo. The environment with higher counts reveals the data-dependent leak.

JVM memory failures are the most common cause of unplanned downtime in Java-based production systems. An OOM kill at 2 AM takes down the service, triggers alerts, and forces on-call engineers to diagnose under pressure.

Most OOM errors are preventable. The JVM provides extensive diagnostics β€” heap dumps, GC logs, JFR recordings β€” but teams rarely configure them before the incident. By the time the OOM fires, the evidence is already gone unless you captured it proactively.

This guide covers the five OOM types, GC tuning trade-offs, memory leak detection patterns, and the production configurations that prevent most memory-related outages. Every pattern comes from systems running at scale β€” not textbook examples.

Start with the Quick Debug Cheat Sheet above if you are actively debugging an incident. Use the sections below for deep understanding and prevention.

Production Debugging Quick Map β€” Symptom to Tool

When a memory incident fires, you need to go from symptom to correct diagnostic tool in seconds. This map is designed to be printed and taped to your monitor.

The key insight: each symptom points to a specific memory region and a specific tool. Using the wrong tool wastes hours. A heap dump does not help with direct memory OOM. GC logs do not help with stack overflow. Match the symptom to the tool.

Decision flow: 1. Read the error message or symptom. 2. Find the matching row in the table below. 3. Run the diagnostic command. 4. Apply the fix.

Severity triage: - Service crashed (OOM) β†’ critical β€” capture diagnostics immediately - Service degraded (slow) β†’ high β€” capture diagnostics within 15 minutes - Service trending toward OOM β†’ medium β€” schedule diagnostics within 1 hour - No symptoms, proactive check β†’ low β€” run diagnostics during maintenance window

The table below covers the 12 most common production memory scenarios. Each row maps symptom β†’ what to check β†’ which tool β†’ immediate action. This is the scan-first view β€” use it before reading any section in detail.

Production insight: the most time-consuming part of memory debugging is choosing the right tool. Engineers waste hours running jmap when they should be reading GC logs, or analyzing heap dumps when the issue is off-heap. This table eliminates that wasted time by mapping symptoms directly to tools.

symptom_tool_map.txt Β· TEXT
1234567891011121314
SYMPTOM                          | WHAT TO CHECK                  | TOOL                           | IMMEDIATE ACTION
---------------------------------+--------------------------------+--------------------------------+-------------------------------------------
Exit code 137 (no JVM error)     | Container memory vs heap       | kubectl top + jcmd VM.native   | Increase container to 1.43x heap
OOM: Java heap space             | Heap contents (what is big?)   | jmap -histo + Eclipse MAT      | Find leak via dominator tree
OOM: Metaspace                   | Classloader count              | jcmd VM.classloader_stats      | Restart JVM, fix classloader leak
OOM: Direct buffer memory        | Direct buffer allocation       | jcmd VM.native_memory summary  | Fix buffer leak, set MaxDirectMemorySize
OOM: GC overhead limit           | GC frequency + old gen usage   | jstat -gcutil + GC logs        | Fix memory leak (GC cannot free enough)
StackOverflowError               | Call stack depth               | jstack <pid>                   | Convert recursion to iteration
Latency spikes (no OOM)          | GC pause times                 | GC logs (-Xlog:gc)             | Tune GC or switch collector
CPU high + slow response         | GC time percentage             | jstat -gcutil <pid> 1s         | If GC time > 5%, fix leak or increase heap
Memory grows over hours          | Old gen trend (post-GC)        | jstat -gcutil + jmap -histo    | Compare histograms, find growing types
OOM only at high traffic         | Allocation rate                | JFR (settings=profile)         | Reduce allocation rate or increase heap
OOM only in production           | Object count comparison        | jmap -histo (prod vs staging)  | Find data-dependent leak
OOM after code deploy            | Code diff (new caches/threads) | git diff + heap dump           | Check for removed eviction logic
Mental Model
The 60-Second Triage Rule
The symptom tells you the memory region. The region tells you the tool. The tool tells you the root cause. Do not skip steps.
  • Error message contains 'heap space' β†’ heap region β†’ jmap + MAT β†’ leak or undersized heap.
  • Error message contains 'Metaspace' β†’ class metadata β†’ jcmd classloader_stats β†’ classloader leak.
  • Error message contains 'Direct buffer' β†’ off-heap NIO β†’ jcmd native_memory β†’ buffer leak.
  • Error message contains 'GC overhead' β†’ GC cannot free β†’ heap dump β†’ memory leak confirmed.
  • No error message, just exit code 137 β†’ container limit β†’ kubectl top β†’ off-heap exceeded container limit.
  • No crash, just slow β†’ GC pauses β†’ GC logs β†’ collector tuning or leak.
πŸ“Š Production Insight
An on-call engineer received an OOM alert at 3 AM. The error was 'java.lang.OutOfMemoryError: Java heap space.' The engineer ran jcmd VM.native_memory summary (wrong tool β€” that is for off-heap). The output showed nothing unusual. Then they ran kubectl top pod (wrong tool β€” that is for container-level). Still nothing. Then they ran jstat -gcutil (useful but not sufficient). After 45 minutes of wrong tools, they finally ran jmap -histo:live and found a HashMap with 8 million entries in 30 seconds.
Cause: mismatched symptom-to-tool mapping. Effect: 45 minutes of wasted debugging time during a 3 AM incident. Impact: extended outage, delayed root cause identification. Action: printed the symptom-to-tool map and taped it to every engineer's monitor. Result: subsequent incidents triaged in under 60 seconds.
The lesson: having the right tools is not enough. You need the right tool for the right symptom. A cheat sheet that maps symptoms to tools eliminates the most common source of debugging delays.
🎯 Key Takeaway
Symptom determines the memory region. Memory region determines the tool. Tool determines the root cause. Print the symptom-to-tool map and eliminate 45 minutes of wrong-tool debugging during incidents.
Which Tool to Use for Each Memory Region
IfSuspected heap leak (heap space OOM, growing old gen)
β†’
Usejmap -histo:live for object counts, jmap -dump for heap dump analysis in MAT. Use jstat -gcutil to confirm old gen growth trend.
IfSuspected off-heap issue (container OOM, direct buffer OOM)
β†’
Usejcmd VM.native_memory summary for breakdown. kubectl top pod for total container usage. Check -XX:MaxDirectMemorySize.
IfSuspected GC problem (latency spikes, GC overhead OOM)
β†’
UseGC logs (-Xlog:gc*) for pause times and frequency. jstat -gcutil for real-time GC activity. Check collector type with PrintCommandLineFlags.
IfSuspected classloader leak (Metaspace OOM)
β†’
Usejcmd VM.classloader_stats for classloader counts. Check for hot-redeploy without JVM restart.
IfSuspected thread issue (StackOverflowError, high thread count)
β†’
Usejstack for thread dump. ThreadMXBean.getThreadCount() for thread count. Check -Xss setting.
IfSuspected allocation rate issue (OOM only at high traffic)
β†’
UseJFR with settings=profile for allocation hotspots. jstat -gcutil for allocation rate estimation. Check GC logs for promotion rate.

Essential JVM Debug Commands β€” Complete Reference

Every production JVM memory incident requires specific commands. This section is the complete reference β€” categorized by tool, with exact syntax and what to look for in the output.

These commands assume JDK 11+ syntax. For JDK 8, some flags differ (noted where applicable).

Critical rule: always run diagnostic commands as the same user that owns the JVM process. In containers, exec into the container: kubectl exec -it <pod> -- /bin/bash.

jcmd β€” the Swiss Army knife. Replaces jinfo, jmap, jstack, and jstat for most operations. Available on all JDK 11+ installations. One tool, many functions.

jmap β€” heap dump and histogram. The primary tool for heap analysis. jmap -histo:live forces a full GC before counting, showing only live objects. jmap -dump:live creates a heap dump file for MAT analysis.

jstat β€” real-time GC monitoring. Shows GC activity in real-time without stopping the JVM. The -gcutil flag shows usage percentages for each generation. Run with 1-second interval for live debugging.

jstack β€” thread dump. Shows all threads and their stack traces. Essential for StackOverflowError and thread-related memory issues (ThreadLocal accumulation).

JFR β€” Java Flight Recorder. Low-overhead continuous profiling. Captures allocation patterns, GC events, and lock contention. Can run in production with <2% overhead.

Production insight: the most commonly confused commands are jmap -histo (object counts, fast) and jmap -dump (full heap dump, slow, pauses JVM). Use -histo first to get a quick overview. Only use -dump when you need the full object graph for MAT analysis. Dumping a 16GB heap pauses the JVM for 10-30 seconds.

Edge case: in Kubernetes, the JVM process PID is usually 1 (the container entrypoint). If your container runs a wrapper script, the JVM PID may be different. Use ps aux | grep java to find the actual PID. Some commands require JAVA_HOME to be set β€” verify with echo $JAVA_HOME before running.

jvm_debug_commands.sh Β· BASH
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157
#!/bin/bash
# ============================================================
# JVM Debug Commands β€” Production Reference
# Run from inside the container or on the host with JVM access
# ============================================================

PID=$(pgrep -f 'java.*-Xmx')  # Find JVM PID

# ============================================================
# JCMD β€” Swiss Army Knife (JDK 11+)
# ============================================================

# List all JVM processes
jcmd

# JVM summary (uptime, arguments, heap config)
jcmd $PID VM.info

# Native memory breakdown (heap, thread, class, GC, direct)
jcmd $PID VM.native_memory summary
jcmd $PID VM.native_memory summary.diff  # Since last baseline
jcmd $PID VM.native_memory baseline      # Set baseline for diff

# Classloader statistics (class count, classloader count)
jcmd $PID VM.classloader_stats

# GC class statistics (instance count and size by class)
jcmd $PID GC.class_stats | head -20

# Force full GC
jcmd $PID GC.run

# Print all VM flags
jcmd $PID VM.flags -all | grep -E '(HeapDump|GC|Metaspace|DirectMemory|ThreadStackSize)'

# Print command line flags (shows effective GC settings)
jcmd $PID VM.command_line

# Thread dump (replaces jstack)
jcmd $PID Thread.print

# Heap dump
jcmd $PID GC.heap_dump /tmp/heap.hprof

# Heap histogram (live objects only, forces GC)
jcmd $PID GC.class_histogram | head -30

# JFR: start recording
jcmd $PID JFR.start name=debug settings=profile maxsize=100M maxage=1h

# JFR: dump recording
jcmd $PID JFR.dump name=debug filename=/tmp/recording.jfr

# JFR: stop recording
jcmd $PID JFR.stop name=debug

# ============================================================
# JMAP β€” Heap Dump and Histogram
# ============================================================

# Histogram of live objects (top 30 by count)
jmap -histo:live $PID | head -30

# Histogram of all objects (including unreachable β€” faster, no GC)
jmap -histo $PID | head -30

# Full heap dump (live objects only β€” forces GC first)
jmap -dump:live,format=b,file=/tmp/heap.hprof $PID

# Full heap dump (all objects β€” faster but larger file)
jmap -dump:format=b,file=/tmp/heap_all.hprof $PID

# ============================================================
# JSTAT β€” Real-Time GC Monitoring
# ============================================================

# GC utilization every 1 second, 10 samples
jstat -gcutil $PID 1000 10

# Output columns:
# S0  β€” Survivor 0 usage %
# S1  β€” Survivor 1 usage %
# E   β€” Eden usage %
# O   β€” Old gen usage %  ← KEY METRIC for leak detection
# M   β€” Metaspace usage %
# CCS β€” Compressed class space usage %
# YGC β€” Young GC count
# YGCT β€” Young GC total time (seconds)
# FGC β€” Full GC count         ← SHOULD BE 0 in healthy service
# FGCT β€” Full GC total time (seconds)
# GCT β€” Total GC time (seconds)

# Key diagnostic:
# If O (old gen) keeps growing after GC β†’ memory leak
# If FGC > 0 and increasing β†’ old gen pressure
# If GCT/uptime > 5% β†’ GC overhead problem

# ============================================================
# JSTACK β€” Thread Dump
# ============================================================

# Full thread dump
jstack $PID > /tmp/threads.txt

# Thread dump with lock information
jstack -l $PID > /tmp/threads_locked.txt

# Count threads by state (useful for thread leak detection)
jstack $PID | grep "java.lang.Thread.State" | sort | uniq -c | sort -rn

# ============================================================
# KUBERNETES / CONTAINER COMMANDS
# ============================================================

# Pod memory usage
kubectl top pod <pod-name> --containers

# Pod memory limits and usage
kubectl describe pod <pod-name> | grep -A 10 "Limits\|Requests"

# Container OOM kill events
kubectl get events --field-selector reason=OOMKilling

# Exec into running container
kubectl exec -it <pod-name> -- /bin/bash

# Check container memory limit from inside container
cat /sys/fs/cgroup/memory/memory.limit_in_bytes  # cgroup v1
cat /sys/fs/cgroup/memory.max                     # cgroup v2

# Check container memory usage from inside container
cat /sys/fs/cgroup/memory/memory.usage_in_bytes   # cgroup v1
cat /sys/fs/cgroup/memory.current                  # cgroup v2

# ============================================================
# QUICK DIAGNOSTIC SEQUENCE (Run this for any OOM)
# ============================================================

echo "=== Quick JVM Memory Diagnostic ==="
echo "PID: $PID"
echo ""
echo "--- 1. JVM Flags ---"
jcmd $PID VM.flags -all | grep -E '(MaxHeap|MaxMetaspace|MaxDirect|ThreadStack|GC)'
echo ""
echo "--- 2. Native Memory Summary ---"
jcmd $PID VM.native_memory summary
echo ""
echo "--- 3. Heap Histogram (top 15) ---"
jmap -histo:live $PID | head -15
echo ""
echo "--- 4. GC Status ---"
jstat -gcutil $PID 1000 5
echo ""
echo "--- 5. Thread Count ---"
jcmd $PID Thread.print | grep "java.lang.Thread.State" | wc -l
echo ""
echo "=== Diagnostic Complete ==="
Mental Model
The Five Commands You Need at 2 AM
Print these five commands. Tape them to your monitor. When the alert fires, run them in order.
  • jcmd $PID VM.native_memory summary β€” shows where all JVM memory is going (heap, threads, metaspace, direct).
  • jmap -histo:live $PID | head -30 β€” shows top 30 object types by count and size. Fast, no heap dump needed.
  • jstat -gcutil $PID 1000 β€” shows GC activity in real-time. Old gen growing = leak. Full GC count rising = pressure.
  • jcmd $PID GC.heap_dump /tmp/heap.hprof β€” full heap dump for MAT analysis. Pauses JVM β€” use only when needed.
  • jstack $PID β€” thread dump for StackOverflowError and ThreadLocal leak detection.
πŸ“Š Production Insight
A team had no standardized debugging process for memory incidents. Each engineer used different commands in different order. One engineer spent 20 minutes trying to find the JVM PID. Another ran jmap -dump (slow, pauses JVM) before running jmap -histo (fast, no pause) β€” the dump took 3 minutes on a 16GB heap and the service became unresponsive.
The team created a standardized diagnostic script that runs the five essential commands in the correct order: flags (5 seconds), native memory (5 seconds), histogram (10 seconds), GC status (5 seconds), thread count (5 seconds). Total time: 30 seconds. The script runs automatically when an OOM alert fires.
Cause: no standardized diagnostic process. Effect: 20+ minutes of ad-hoc debugging per incident, wrong command order causing service disruption. Impact: extended outages, on-call burnout. Action: created automated diagnostic script, printed command cheat sheet. Result: 30-second diagnostic baseline, consistent debugging across all engineers.
Key insight: the order matters. Run fast commands first (flags, histogram, GC status). Run slow commands only if fast commands do not reveal the issue. Never run jmap -dump before jmap -histo β€” the histogram often reveals the problem without needing the full dump.
🎯 Key Takeaway
Five commands cover 95% of memory incidents: native_memory summary, jmap -histo, jstat -gcutil, GC.heap_dump, and Thread.print. Run fast commands first. Never dump before histogram. Print the cheat sheet.
Which Command to Run First
IfJust received OOM alert β€” need quick triage
β†’
UseRun jcmd VM.native_memory summary (5 sec) + jmap -histo:live | head -30 (10 sec). Total 15 seconds. This covers 80% of incidents.
IfHistogram shows no dominant object β€” need full analysis
β†’
UseRun jcmd GC.heap_dump /tmp/heap.hprof. Analyze in Eclipse MAT. Check dominator tree and histogram comparison.
IfService is slow but not crashed β€” suspect GC
β†’
UseRun jstat -gcutil $PID 1000 for 30 seconds. If old gen is full and Full GC count is rising, you have old gen pressure.
IfStackOverflowError or thread-related issue
β†’
UseRun jstack $PID. Look for repeating method signatures in the stack trace. Count threads by state.
IfContainer OOM kill (exit 137) β€” no JVM error
β†’
UseRun kubectl describe pod + kubectl top pod. Then run jcmd VM.native_memory summary inside the container to profile off-heap.
IfNeed continuous profiling without stopping the service
β†’
UseStart JFR: jcmd $PID JFR.start settings=profile maxage=1h. Dump on demand: jcmd $PID JFR.dump filename=/tmp/rec.jfr. Overhead <2%.

Understanding the Five OOM Types

Most developers treat OOM as a single error. It is not. The JVM has five distinct OOM conditions, each with different causes, diagnostics, and fixes. Treating them interchangeably leads to misdiagnosis.

Java heap space β€” the most common. The heap (young gen + old gen) is full and GC cannot free enough space. Almost always a memory leak or undersized heap.

Metaspace β€” class metadata storage is full. Common in hot-redeploy environments where classloaders accumulate. Rarely a sizing issue β€” almost always a classloader leak.

Direct buffer memory β€” off-heap NIO buffer allocation failed. Common in Netty, gRPC, and NIO-based services. Usually a buffer leak or insufficient MaxDirectMemorySize.

GC overhead limit exceeded β€” GC is running continuously and recovering almost nothing. The JVM's way of saying 'I tried GC, it did not help, you have a leak.' This is a leak indicator, not a sizing issue.

Stack overflow β€” thread call stack exceeded -Xss. Not a memory leak β€” it is a recursion depth problem. But it manifests as an OOM in monitoring.

The critical insight: each OOM type requires a different diagnostic approach. A heap dump does not help with Metaspace OOM. Increasing -Xmx does not fix direct buffer memory OOM. Matching the OOM type to the correct diagnostic tool is the first step.

Production edge case: some OOM types are caught by the JVM (heap space, metaspace), while others kill the process externally. Container OOM killer (exit code 137) bypasses the JVM entirely β€” no heap dump, no error message, just a dead process. This is why container memory limits must account for off-heap usage.

Performance implication: each OOM type has different latency characteristics. Heap OOM causes gradual degradation (GC pauses increase). Metaspace OOM is sudden (class loading fails). Direct memory OOM is sudden (buffer allocation fails). Stack overflow is immediate (thread dies). Understanding the failure mode helps you detect it earlier.

oom_type_detector.java Β· JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
package io.thecodeforge.monitoring;

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.GarbageCollectorMXBean;
import java.util.List;
import java.util.Map;
import java.util.HashMap;

/**
 * OOM Type Detector β€” identifies which memory region is at risk
 * before an OOM occurs.
 */
public class OomTypeDetector {

    private static final double HEAP_WARNING_THRESHOLD = 0.80;
    private static final double HEAP_CRITICAL_THRESHOLD = 0.90;
    private static final double METASPACE_WARNING_THRESHOLD = 0.80;

    public enum RiskLevel {
        HEALTHY, WARNING, CRITICAL, IMMINENT
    }

    public enum OomType {
        HEAP_SPACE,
        METASPACE,
        DIRECT_BUFFER,
        GC_OVERHEAD,
        STACK_OVERFLOW,
        CONTAINER_LIMIT
    }

    public static class MemoryRiskReport {
        public RiskLevel heapRisk;
        public RiskLevel metaspaceRisk;
        public RiskLevel gcOverheadRisk;
        public Map<OomType, String> recommendations;
        public long heapUsedMB;
        public long heapMaxMB;
        public long metaspaceUsedMB;
        public long metaspaceMaxMB;
        public double gcTimePercent;

        public MemoryRiskReport() {
            recommendations = new HashMap<>();
        }
    }

    public static MemoryRiskReport analyze() {
        MemoryRiskReport report = new MemoryRiskReport();
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();

        // Heap analysis
        MemoryUsage heapUsage = memBean.getHeapMemoryUsage();
        report.heapUsedMB = heapUsage.getUsed() / (1024 * 1024);
        report.heapMaxMB = heapUsage.getMax() / (1024 * 1024);
        double heapPercent = (double) heapUsage.getUsed() / heapUsage.getMax();

        if (heapPercent >= HEAP_CRITICAL_THRESHOLD) {
            report.heapRisk = RiskLevel.CRITICAL;
            report.recommendations.put(OomType.HEAP_SPACE,
                "Heap at " + (int)(heapPercent * 100) + "% β€” capture heap dump and analyze dominator tree.");
        } else if (heapPercent >= HEAP_WARNING_THRESHOLD) {
            report.heapRisk = RiskLevel.WARNING;
            report.recommendations.put(OomType.HEAP_SPACE,
                "Heap at " + (int)(heapPercent * 100) + "% β€” monitor growth rate.");
        } else {
            report.heapRisk = RiskLevel.HEALTHY;
        }

        // Metaspace analysis
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            if (pool.getName().contains("Metaspace")) {
                MemoryUsage usage = pool.getUsage();
                report.metaspaceUsedMB = usage.getUsed() / (1024 * 1024);
                report.metaspaceMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : -1;
                if (report.metaspaceMaxMB > 0) {
                    double metaPercent = (double) usage.getUsed() / usage.getMax();
                    if (metaPercent >= METASPACE_WARNING_THRESHOLD) {
                        report.metaspaceRisk = RiskLevel.WARNING;
                        report.recommendations.put(OomType.METASPACE,
                            "Metaspace at " + (int)(metaPercent * 100)
                            + "% β€” check for classloader leaks.");
                    } else {
                        report.metaspaceRisk = RiskLevel.HEALTHY;
                    }
                }
            }
        }

        // GC overhead analysis
        long totalGcTimeMs = 0;
        long totalGcCount = 0;
        for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) {
            totalGcTimeMs += gc.getCollectionTime();
            totalGcCount += gc.getCollectionCount();
        }
        long uptimeMs = ManagementFactory.getRuntimeMXBean().getUptime();
        report.gcTimePercent = (double) totalGcTimeMs / uptimeMs * 100;

        if (report.gcTimePercent > 5.0) {
            report.gcOverheadRisk = RiskLevel.CRITICAL;
            report.recommendations.put(OomType.GC_OVERHEAD,
                "GC consuming " + String.format("%.1f", report.gcTimePercent)
                + "% of uptime β€” likely memory leak. Capture heap dump.");
        } else if (report.gcTimePercent > 2.0) {
            report.gcOverheadRisk = RiskLevel.WARNING;
        } else {
            report.gcOverheadRisk = RiskLevel.HEALTHY;
        }

        return report;
    }
}
Mental Model
The Five OOM Types β€” Each Needs a Different Diagnostic
Match the OOM type to the correct diagnostic tool before spending hours debugging the wrong thing.
  • Heap space: heap dump (jmap, -XX:+HeapDumpOnOutOfMemoryError). Look at dominator tree for leak suspects.
  • Metaspace: classloader analysis (jcmd VM.classloader_stats). Look for classloaders with high class count that should have been unloaded.
  • Direct buffer: NativeMemoryTracking (-XX:NativeMemoryTracking=detail, jcmd VM.native_memory). Look for buffer allocation without corresponding release.
  • GC overhead: heap dump + GC log analysis. The leak is in old gen β€” look for objects that survive full GC.
  • Stack overflow: thread dump (jstack). Look for repeating method signatures indicating infinite recursion.
πŸ“Š Production Insight
A microservices team spent 3 days debugging a Metaspace OOM by increasing MaxMetaspaceSize from 256MB to 1GB. The OOM returned after 2 days. The real issue was a classloader leak caused by a reflection-based plugin system that cached Class objects in a static HashMap. Each redeployment loaded new classes but the old Class references were never released. The static HashMap grew indefinitely.
Cause: static HashMap caching Class objects from dynamically loaded classloaders. Effect: old classloaders could not be GC'd because the static map held references. Metaspace grew by 50MB per redeployment. Impact: service crashed every 2-3 days. Action: replaced static HashMap with WeakHashMap, added classloader leak detection using -verbose:class. Result: Metaspace stabilized at 80MB, no further OOMs.
Trade-off: WeakHashMap entries can be GC'd at any time, which means cached Class lookups may return null. Added a fallback path that reloads the class if the WeakHashMap entry was collected. Performance impact: ~0.1ms per cache miss, acceptable for a plugin system.
🎯 Key Takeaway
Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging. Heap dump for heap space, classloader stats for metaspace, NMT for direct memory, GC logs for overhead, thread dump for stack overflow.
Which OOM Type Are You Dealing With

Heap Dump Analysis: Finding the Leak

A heap dump is a snapshot of every object in the JVM heap at a point in time. It is the single most important diagnostic artifact for heap OOM. Without it, you are guessing. With it, you can identify the exact object, its reference chain to GC root, and its retained size.

The key concept is the dominator tree. In a heap dump, object A dominates object B if every path from GC roots to B goes through A. The dominator tree shows which objects retain the most memory. The top entries in the dominator tree are your leak suspects.

Eclipse MAT (Memory Analyzer Tool) is the standard tool for heap dump analysis. The three reports that matter most: Leak Suspects Report (automated analysis), Dominator Tree (manual exploration), and Histogram (object count by type).

The Leak Suspects Report is the starting point. It identifies objects with unusually high retained size and shows the reference chain from GC root. If the report identifies a single suspect consuming 60%+ of heap, you have found the leak.

But the automated report does not always find the leak. Some leaks are distributed β€” no single object dominates, but thousands of small objects accumulate. In this case, use the Histogram to find object types with unexpectedly high counts. Compare with a second heap dump taken 1 hour later. The type with the fastest-growing count is the leak source.

Production insight: always take at least two heap dumps, 30-60 minutes apart. A single dump shows the current state. Two dumps show the trend. The trend is what reveals leaks.

Heap dump caveat: taking a heap dump pauses the JVM (full stop-the-world) for the duration of the dump. For a 4GB heap, this can be 10-30 seconds. For a 32GB heap, it can be several minutes. Never take a heap dump on a production system during peak traffic without understanding the pause impact. Use jmap -dump:live,format=b,file=heap.hprof <pid> to force a full GC first and capture only live objects, reducing dump size.

Alternative for large heaps: use JFR allocation profiling (-XX:StartFlightRecording=settings=profile) to capture allocation patterns without a full heap dump. JFR adds less than 2% overhead and can run continuously in production. It does not show object graphs, but it shows which code is allocating the most memory.

Performance trade-off: heap dump pause time is proportional to live object count, not heap size. A 16GB heap with 2GB live objects dumps faster than an 8GB heap with 6GB live objects. Use -XX:+HeapDumpOnOutOfMemoryError (auto-dump on OOM) and -XX:HeapDumpPath=/var/log/jvm/ to ensure dumps are captured even during unattended failures.

heap_dump_analyzer.java Β· JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
package io.thecodeforge.diagnostics;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryUsage;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

/**
 * Proactive heap monitor that captures dumps when memory
 * growth rate indicates a leak β€” before OOM occurs.
 */
public class ProactiveHeapMonitor {

    private final ScheduledExecutorService scheduler;
    private final List<Snapshot> history;
    private final long maxHeapMB;
    private final double growthRateThresholdMBPerHour;
    private final String dumpDirectory;

    public ProactiveHeapMonitor(
            long maxHeapMB,
            double growthRateThresholdMBPerHour,
            String dumpDirectory
    ) {
        this.maxHeapMB = maxHeapMB;
        this.growthRateThresholdMBPerHour = growthRateThresholdMBPerHour;
        this.dumpDirectory = dumpDirectory;
        this.history = new ArrayList<>();
        this.scheduler = Executors.newSingleThreadScheduledExecutor(
            r -> {
                Thread t = new Thread(r, "heap-monitor");
                t.setDaemon(true);
                return t;
            }
        );
    }

    public void start(long intervalSeconds) {
        scheduler.scheduleAtFixedRate(
            this::checkMemory,
            intervalSeconds,
            intervalSeconds,
            TimeUnit.SECONDS
        );
    }

    private void checkMemory() {
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();
        MemoryUsage heapUsage = memBean.getHeapMemoryUsage();
        long usedMB = heapUsage.getUsed() / (1024 * 1024);
        Instant now = Instant.now();

        history.add(new Snapshot(now, usedMB));

        // Keep only last 24 hours of snapshots
        Instant cutoff = now.minusSeconds(86400);
        history.removeIf(s -> s.timestamp.isBefore(cutoff));

        // Check absolute threshold
        double usagePercent = (double) usedMB / maxHeapMB;
        if (usagePercent > 0.85) {
            logWarning("Heap usage at " + (int)(usagePercent * 100)
                + "% (" + usedMB + "MB / " + maxHeapMB + "MB)");
            if (usagePercent > 0.90) {
                captureHeapDump("high-usage-" + now.getEpochSecond());
            }
        }

        // Check growth rate (leak detection)
        if (history.size() >= 2) {
            Snapshot oldest = history.get(0);
            Snapshot newest = history.get(history.size() - 1);
            double hoursElapsed = (newest.timestamp.toEpochMilli()
                - oldest.timestamp.toEpochMilli()) / 3_600_000.0;
            if (hoursElapsed > 0.5) {
                double growthRateMBPerHour = (newest.usedMB - oldest.usedMB)
                    / hoursElapsed;
                if (growthRateMBPerHour > growthRateThresholdMBPerHour) {
                    logWarning("Heap growth rate: " + growthRateMBPerHour
                        + " MB/hour β€” possible leak");
                    captureHeapDump("leak-suspect-" + now.getEpochSecond());
                }
            }
        }
    }

    private void captureHeapDump(String label) {
        String filename = dumpDirectory + "/heap-" + label + ".hprof";
        try {
            String pid = ManagementFactory.getRuntimeMXBean().getName()
                .split("@")[0];
            ProcessBuilder pb = new ProcessBuilder(
                "jmap", "-dump:live,format=b,file=" + filename, pid
            );
            pb.redirectErrorStream(true);
            Process p = pb.start();
            int exitCode = p.waitFor();
            if (exitCode == 0) {
                logWarning("Heap dump captured: " + filename);
            } else {
                logWarning("Heap dump failed with exit code: " + exitCode);
            }
        } catch (Exception e) {
            logWarning("Heap dump failed: " + e.getMessage());
        }
    }

    private void logWarning(String message) {
        System.err.println("[HeapMonitor] " + Instant.now() + " " + message);
    }

    private static class Snapshot {
        final Instant timestamp;
        final long usedMB;
        Snapshot(Instant timestamp, long usedMB) {
            this.timestamp = timestamp;
            this.usedMB = usedMB;
        }
    }
}
Mental Model
Two Dumps Beat One β€” Trend Analysis Reveals Leaks
Object count growing between dumps = leak. Object count stable between dumps = right-sizing issue. Always compare two dumps.
  • Single dump: shows what is in the heap now. Useful for finding large objects. Cannot distinguish leak from legitimate usage.
  • Two dumps: shows what is growing. The object type with the fastest-growing count is the leak source.
  • Dominator tree: shows which objects retain the most memory. Top entries are leak suspects.
  • Leak Suspects Report: automated MAT analysis. Good starting point. Fails on distributed leaks (many small objects).
  • Histogram comparison: export histograms from both dumps, diff them. The type with the largest count increase is the leak.
πŸ“Š Production Insight
A recommendation engine service used 12GB of its 16GB heap. The team took a single heap dump and found no single object dominating memory β€” the largest retained object was 200MB. They concluded the heap was simply too small and requested 32GB from infrastructure.
A senior engineer took two dumps 45 minutes apart and compared histograms. The count of io.thecodeforge.model.CachedRecommendation objects grew from 8.2 million to 8.7 million in 45 minutes β€” 666,000 new objects/hour, each ~1.2KB. The leak was distributed across millions of small objects, invisible in a single dump's dominator tree.
Cause: recommendation cache had no eviction policy. Each unique user+product combination created a CachedRecommendation that was never removed. Effect: 666K new objects/hour, ~800MB/hour growth. Impact: OOM every 20 hours. Action: added Caffeine cache with expireAfterWrite(1, TimeUnit.HOURS) and maximumSize(5_000_000). Result: steady-state heap dropped to 4GB, no OOM.
Key insight: single dump analysis missed this leak entirely because no single object dominated. Two-dump histogram comparison revealed it in minutes.
🎯 Key Takeaway
Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type. The dominator tree finds large objects; histogram comparison finds distributed leaks.
Heap Dump Analysis Strategy

GC Tuning: Collector Selection and Parameter Optimization

GC tuning is about trade-offs: throughput vs latency, pause time vs frequency, memory efficiency vs allocation speed. There is no universal best setting β€” the right configuration depends on your workload profile.

The four production GC collectors:

G1GC (default since JDK 9): balanced throughput and latency. Good default for most services. Tuning targets: -XX:MaxGCPauseMillis (default 200ms), -XX:G1HeapRegionSize, -XX:InitiatingHeapOccupancyPercent.

ZGC (JDK 15+): sub-millisecond pause times regardless of heap size. Best for latency-sensitive services (trading, real-time). Trade-off: slightly lower throughput, higher CPU usage for concurrent GC threads.

Shenandoah (JDK 12+): similar to ZGC β€” low pause times, concurrent compaction. Trade-off: same as ZGC. Choose based on JDK vendor support.

Parallel GC: highest throughput, longest pauses. Best for batch processing where latency does not matter. Not recommended for interactive services.

The most common GC tuning mistake: switching collectors without understanding the workload. A team switched from G1GC to ZGC because they read it was 'faster.' Their service was a batch ETL pipeline that did not care about pause times. ZGC's extra CPU overhead reduced throughput by 8% for zero benefit.

Rule of thumb: if your service is latency-sensitive (p99 < 100ms), use ZGC or Shenandoah. If throughput matters more than latency, use Parallel GC. For everything else, G1GC is the right default.

Humongous allocations are a G1GC-specific problem. Objects larger than 50% of a G1 region (default region size is ~1-2MB depending on heap) are classified as humongous. They are allocated in contiguous regions and only reclaimed during full GC. If your service allocates many large byte arrays or StringBuilders, humongous allocations cause premature old gen promotion and full GC storms.

Fix: increase -XX:G1HeapRegionSize to reduce humongous threshold, or refactor code to avoid large contiguous allocations. Check GC logs for 'Humongous allocation' lines.

GC log analysis is essential. Enable GC logging with -Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M (JDK 11+). Key metrics to monitor: GC pause duration (max, p99, p95), GC frequency (pauses per minute), allocation rate (MB/sec), promotion rate (young gen to old gen MB/sec), and old gen usage after GC.

Production insight: the most impactful GC parameter is often not the collector itself, but the heap size relative to live data. If your live data set is 2GB and your heap is 8GB, GC has plenty of room to work. If your live data set is 6GB and your heap is 8GB, GC is constantly under pressure. Right-sizing the heap matters more than collector selection.

Edge case: containerized JVMs with cgroup memory limits. Prior to JDK 10, the JVM did not respect cgroup limits and would set heap based on host memory. JDK 10+ respects cgroup limits. Always verify with -XX:+PrintFlagsFinal | grep MaxHeapSize that the JVM sees the correct memory limit.

gc_analyzer.java Β· JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
package io.thecodeforge.monitoring;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * GC Log Analyzer β€” parses GC logs and extracts key metrics
 * for production tuning decisions.
 */
public class GcLogAnalyzer {

    // JDK 11+ unified GC log format
    private static final Pattern GC_PAUSE_PATTERN = Pattern.compile(
        "\\[(?<timestamp>[\\d-T:.]+)\\]\\[(?<uptime>[\\d.]+)s\\]\\[(?<level>\\w+)\\]"
        + ".*GC\\((?<gcId>\\d+)\\) Pause (?<type>Young|Full|Mixed)"
        + ".*?(?<durationMs>[\\d.]+)ms"
    );

    private static final Pattern HEAP_PATTERN = Pattern.compile(
        "(?<used>\\d+)K->(?<after>\\d+)K\\((?<total>\\d+)K\\)"
    );

    public static class GcMetrics {
        public int totalGcPauses;
        public int youngGcCount;
        public int fullGcCount;
        public int mixedGcCount;
        public double maxPauseMs;
        public double p99PauseMs;
        public double p95PauseMs;
        public double avgPauseMs;
        public double totalPauseMs;
        public double gcTimePercent;
        public long maxHeapUsedKB;
        public long minHeapAfterGcKB;
        public List<Double> pauseTimes = new ArrayList<>();
    }

    public static GcMetrics analyze(String gcLogFile) throws IOException {
        GcMetrics metrics = new GcMetrics();

        try (BufferedReader reader = new BufferedReader(
                new FileReader(gcLogFile))) {
            String line;
            while ((line = reader.readLine()) != null) {
                Matcher pauseMatcher = GC_PAUSE_PATTERN.matcher(line);
                if (pauseMatcher.find()) {
                    double duration = Double.parseDouble(
                        pauseMatcher.group("durationMs"));
                    String type = pauseMatcher.group("type");

                    metrics.totalGcPauses++;
                    metrics.pauseTimes.add(duration);
                    metrics.totalPauseMs += duration;

                    switch (type) {
                        case "Young": metrics.youngGcCount++; break;
                        case "Full":  metrics.fullGcCount++; break;
                        case "Mixed": metrics.mixedGcCount++; break;
                    }

                    if (duration > metrics.maxPauseMs) {
                        metrics.maxPauseMs = duration;
                    }
                }

                Matcher heapMatcher = HEAP_PATTERN.matcher(line);
                if (heapMatcher.find()) {
                    long used = Long.parseLong(heapMatcher.group("used"));
                    long after = Long.parseLong(heapMatcher.group("after"));
                    if (used > metrics.maxHeapUsedKB) {
                        metrics.maxHeapUsedKB = used;
                    }
                    if (metrics.minHeapAfterGcKB == 0
                            || after < metrics.minHeapAfterGcKB) {
                        metrics.minHeapAfterGcKB = after;
                    }
                }
            }
        }

        // Calculate percentiles
        if (!metrics.pauseTimes.isEmpty()) {
            metrics.pauseTimes.sort(Double::compareTo);
            int size = metrics.pauseTimes.size();
            metrics.avgPauseMs = metrics.totalPauseMs / size;
            metrics.p95PauseMs = metrics.pauseTimes.get((int)(size * 0.95));
            metrics.p99PauseMs = metrics.pauseTimes.get((int)(size * 0.99));
        }

        return metrics;
    }

    public static String generateReport(GcMetrics m) {
        StringBuilder sb = new StringBuilder();
        sb.append("=== GC Analysis Report ===\n");
        sb.append("Total GC pauses: ").append(m.totalGcPauses).append("\n");
        sb.append("Young GC: ").append(m.youngGcCount).append("\n");
        sb.append("Full GC: ").append(m.fullGcCount).append("\n");
        sb.append("Mixed GC: ").append(m.mixedGcCount).append("\n");
        sb.append("Max pause: ").append(m.maxPauseMs).append(" ms\n");
        sb.append("P99 pause: ").append(m.p99PauseMs).append(" ms\n");
        sb.append("P95 pause: ").append(m.p95PauseMs).append(" ms\n");
        sb.append("Avg pause: ").append(String.format("%.2f", m.avgPauseMs)).append(" ms\n");
        sb.append("Max heap used: ").append(m.maxHeapUsedKB / 1024).append(" MB\n");
        sb.append("Min heap after GC: ").append(m.minHeapAfterGcKB / 1024).append(" MB\n");

        // Warnings
        if (m.fullGcCount > 0) {
            sb.append("WARNING: Full GC detected β€” investigate old gen pressure\n");
        }
        if (m.p99PauseMs > 200) {
            sb.append("WARNING: P99 pause > 200ms β€” consider ZGC or Shenandoah\n");
        }
        if (m.minHeapAfterGcKB > 0) {
            long liveDataMB = m.minHeapAfterGcKB / 1024;
            sb.append("INFO: Live data set ~").append(liveDataMB).append(" MB\n");
            sb.append("INFO: Recommended heap (2x live data): ")
                .append(liveDataMB * 2).append(" MB\n");
        }

        return sb.toString();
    }
}
Mental Model
The GC Trade-off Triangle
Batch jobs want throughput. Real-time services want latency. Cost-sensitive systems want memory efficiency. The collector choice follows from the priority.
  • Throughput (Parallel GC): minimize time spent in GC relative to application work. Best for batch processing. Long pauses are acceptable.
  • Latency (ZGC/Shenandoah): minimize individual GC pause times. Best for real-time services. Higher CPU overhead is acceptable.
  • Memory efficiency (G1GC): balance between throughput and latency with moderate memory overhead. Best default for most services.
  • Humongous allocations: objects >50% of G1 region size cause full GC. Increase region size or refactor large allocations.
  • Container awareness: JDK 10+ respects cgroup limits. Always verify with PrintFlagsFinal. Pre-JDK 10 ignores container memory limits.
πŸ“Š Production Insight
A trading platform used G1GC with 32GB heap. During market open, GC pauses reached 400ms β€” causing order processing delays and regulatory violations. The team tuned G1GC parameters for 3 weeks, reducing pauses to 250ms. Still not good enough.
Switching to ZGC reduced pauses to 0.8ms consistently. The trade-off: ZGC used 15% more CPU for concurrent GC threads. The platform had spare CPU capacity, so this was acceptable.
Cause: G1GC stop-the-world pauses during concurrent marking. Effect: 400ms pauses during peak allocation rate. Impact: order processing delays, regulatory SLA violations. Action: switched to ZGC with -XX:+UseZGC -Xmx32g -XX:ConcGCThreads=4. Result: 0.8ms p99 pauses, 15% CPU increase, zero SLA violations.
Trade-off: if the platform had been CPU-bound, ZGC's overhead would have been unacceptable. The fix worked because CPU was the cheaper resource to trade for latency. Always profile CPU usage before switching collectors.
🎯 Key Takeaway
GC tuning is about trade-offs: throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC. Full GC is always a problem β€” find the root cause.
GC Collector Selection
IfService requires p99 latency < 50ms
β†’
UseUse ZGC (JDK 15+) or Shenandoah (JDK 12+). Sub-millisecond pauses. Accept higher CPU overhead.
IfService is a batch job or ETL pipeline
β†’
UseUse Parallel GC. Highest throughput. Long pauses are acceptable since there is no user waiting.
IfGeneral-purpose web service or API
β†’
UseUse G1GC (JDK 9+ default). Tune MaxGCPauseMillis to your SLA. Good balance of throughput and latency.

Memory Leak Patterns and Detection

Memory leaks in Java are objects that are no longer needed but remain referenced, preventing garbage collection. Unlike C/C++ leaks (freed memory), Java leaks are reachable objects that should be unreachable.

The five most common leak patterns in production:

Unbounded collections β€” Maps, Lists, or Sets that grow without limit. The #1 cause of heap OOM. Fix: use bounded caches (Caffeine, Guava) with TTL and maximumSize.

Listener/callback registration without deregistration β€” registering event listeners that hold references to the subscriber object. When the subscriber should be GC'd, the listener reference keeps it alive. Fix: always deregister in close()/destroy() methods.

ThreadLocal without cleanup β€” ThreadLocal values persist for the lifetime of the thread. In thread pools, threads live forever. ThreadLocal values accumulate indefinitely. Fix: call threadLocal.remove() in a finally block after use.

ClassLoader leaks β€” in hot-redeploy environments, old classloaders remain referenced by static fields or thread-locals. The classloader cannot be GC'd, and neither can all classes it loaded. Fix: avoid static references to classes from dynamic classloaders. Use WeakReference or ServiceLoader patterns.

String.intern() abuse β€” String.intern() stores strings in the string pool (native memory pre-JDK 7, heap post-JDK 7). Interning user-generated strings creates an unbounded pool. Fix: never intern user input. Use a bounded cache with eviction instead.

Detection strategy: the sawtooth test. Monitor heap usage over time. A healthy JVM shows a sawtooth pattern β€” heap rises during allocation, drops after GC, returns to the same baseline. A leak shows the same sawtooth, but the baseline after GC increases over time. The post-GC baseline is the key metric.

Production tool: Java Flight Recorder (JFR) with allocation profiling. JFR records every significant allocation with the call stack. Enable with -XX:StartFlightRecording=settings=profile,duration=60s,filename=alloc.jfr. Analyze with JDK Mission Control (JMC) β€” the 'Allocation by Thread' and 'Allocation by Class' views show where memory is being allocated.

Edge case: soft reference accumulation. The JVM collects SoftReferences only when heap pressure is high. If your cache uses SoftReferences, it will consume all available heap before releasing entries. This is by design, but it makes heap appear full even when it is not leaking. Switch to WeakReference or use a proper cache library with size-based eviction.

Performance consideration: leak detection tools (JFR, MAT) add overhead. JFR adds <2% CPU overhead and can run continuously. MAT analysis requires a heap dump, which pauses the JVM. Use JFR for continuous monitoring and MAT for post-mortem analysis.

leak_detector.java Β· JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
package io.thecodeforge.diagnostics;

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.GarbageCollectorMXBean;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

/**
 * Memory Leak Detector β€” monitors old gen growth rate
 * to detect leaks before OOM occurs.
 *
 * Core insight: a leak shows as increasing old gen usage
 * after each full GC. The post-GC baseline is the key metric.
 */
public class MemoryLeakDetector {

    private final ScheduledExecutorService scheduler;
    private final List<OldGenSnapshot> snapshots;
    private final double alertThresholdMBPerHour;
    private final LeakAlertHandler alertHandler;

    public interface LeakAlertHandler {
        void onLeakDetected(double growthRateMBPerHour,
                           long currentOldGenMB,
                           String recommendation);
    }

    public MemoryLeakDetector(
            double alertThresholdMBPerHour,
            LeakAlertHandler alertHandler
    ) {
        this.alertThresholdMBPerHour = alertThresholdMBPerHour;
        this.alertHandler = alertHandler;
        this.snapshots = new ArrayList<>();
        this.scheduler = Executors.newSingleThreadScheduledExecutor(
            r -> {
                Thread t = new Thread(r, "leak-detector");
                t.setDaemon(true);
                return t;
            }
        );
    }

    public void start(long intervalSeconds) {
        scheduler.scheduleAtFixedRate(
            this::sampleOldGen,
            intervalSeconds,
            intervalSeconds,
            TimeUnit.SECONDS
        );
    }

    private void sampleOldGen() {
        long oldGenUsedMB = getOldGenUsedMB();
        Instant now = Instant.now();

        snapshots.add(new OldGenSnapshot(now, oldGenUsedMB));

        // Keep only last 6 hours
        Instant cutoff = now.minusSeconds(21600);
        snapshots.removeIf(s -> s.timestamp.isBefore(cutoff));

        // Need at least 30 minutes of data
        if (snapshots.size() < 6) return;

        // Calculate growth rate
        OldGenSnapshot oldest = snapshots.get(0);
        OldGenSnapshot newest = snapshots.get(snapshots.size() - 1);
        double hoursElapsed = (newest.timestamp.toEpochMilli()
            - oldest.timestamp.toEpochMilli()) / 3_600_000.0;

        if (hoursElapsed < 0.5) return;

        double growthRateMBPerHour = (newest.usedMB - oldest.usedMB)
            / hoursElapsed;

        if (growthRateMBPerHour > alertThresholdMBPerHour) {
            String recommendation = buildRecommendation(
                growthRateMBPerHour, newest.usedMB);
            alertHandler.onLeakDetected(
                growthRateMBPerHour, newest.usedMB, recommendation);
        }
    }

    private long getOldGenUsedMB() {
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            String name = pool.getName();
            if (name.contains("Old") || name.contains("Tenured")) {
                return pool.getUsage().getUsed() / (1024 * 1024);
            }
        }
        // Fallback: use heap usage
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();
        return memBean.getHeapMemoryUsage().getUsed() / (1024 * 1024);
    }

    private String buildRecommendation(
            double growthRateMBPerHour, long currentOldGenMB
    ) {
        StringBuilder sb = new StringBuilder();
        sb.append("Memory leak detected. ");
        sb.append("Growth rate: ").append(String.format("%.1f", growthRateMBPerHour));
        sb.append(" MB/hour. ");
        sb.append("Current old gen: ").append(currentOldGenMB).append(" MB. ");
        sb.append("Actions: ");
        sb.append("1) Capture heap dump (jmap -dump:live,format=b,file=heap.hprof). ");
        sb.append("2) Analyze with MAT β€” check dominator tree and histogram. ");
        sb.append("3) Compare with previous histogram to find growing object types.");
        return sb.toString();
    }

    private static class OldGenSnapshot {
        final Instant timestamp;
        final long usedMB;
        OldGenSnapshot(Instant timestamp, long usedMB) {
            this.timestamp = timestamp;
            this.usedMB = usedMB;
        }
    }
}
Mental Model
The Sawtooth Test β€” Is It a Leak or Just Load?
The post-GC baseline is the only metric that reveals a leak. Peak usage is irrelevant β€” it depends on allocation rate and GC timing.
  • Healthy pattern: heap rises to 4GB, GC brings it back to 1.5GB. Next cycle: rises to 4GB, back to 1.5GB. Baseline is stable.
  • Leak pattern: heap rises to 4GB, GC brings it to 1.5GB. Next cycle: rises to 4GB, back to 1.8GB. Next: back to 2.1GB. Baseline is rising.
  • Key metric: old gen usage after full GC. Monitor this, not peak heap usage.
  • Detection: take snapshots every 30 seconds. Calculate growth rate of post-GC baseline. Alert if >5% per hour.
  • False positive: legitimate cache growth (new data being cached) looks like a leak. Distinguish by checking if the growth stabilizes.
πŸ“Š Production Insight
A session management service showed stable memory usage for 6 months. After a feature release, the team noticed heap usage after GC growing at 100MB/hour. They suspected a leak but could not find it in the heap dump β€” no single object dominated.
The leak was a ThreadLocal in a request filter that stored user context. The filter was called on every request, and the ThreadLocal was set but never removed. In a thread pool, threads live forever, so ThreadLocal values accumulated indefinitely. Each user context was ~2KB. At 50,000 unique users per hour, that was 100MB/hour.
Cause: ThreadLocal.set() without ThreadLocal.remove() in a request filter. Effect: each thread accumulated user contexts for every user it served. Impact: 100MB/hour growth, OOM every 10 hours. Action: added threadLocal.remove() in a finally block after request processing. Result: memory growth dropped to zero.
Why the heap dump did not help: ThreadLocal values are stored in the Thread object's threadLocals map, not in a global collection. The dominator tree showed many Thread objects, each holding a small map. Without knowing to look at ThreadLocal, the dump appeared healthy.
🎯 Key Takeaway
Monitor post-GC old gen baseline, not peak heap usage. A rising baseline confirms a leak. Take two dumps 30-60 minutes apart and compare histograms. ThreadLocal and unbounded caches are the most common production leak sources.
Memory Leak Detection Strategy

Production JVM Configuration: Flags That Matter

JVM configuration is where most memory incidents are prevented β€” or caused. The wrong flags make debugging impossible. The right flags make it trivial.

Non-negotiable production flags:

-XX:+HeapDumpOnOutOfMemoryError β€” captures a heap dump when OOM occurs. Without this, you have no diagnostic data after the crash. Set -XX:HeapDumpPath to a persistent directory (not /tmp in containers β€” /tmp is often tmpfs and too small).

-Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M β€” enables GC logging with rotation. Essential for diagnosing GC issues. JDK 11+ syntax. For JDK 8, use -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log.

-XX:+ExitOnOutOfMemoryError β€” kills the JVM immediately on OOM instead of leaving it in an undefined state. In containerized environments, this ensures the container restarts via the orchestrator. Without this, the JVM may continue running in a degraded state, accepting requests it cannot process.

-XX:MaxRAMPercentage=70.0 β€” sets max heap as a percentage of container memory. Alternative to -Xmx for containerized deployments. Automatically adjusts when container limits change. Use 70-75% to leave room for off-heap.

Container memory calculation: Container memory = heap (Xmx) + metaspace + thread stacks (Xss Γ— thread count) + direct memory (MaxDirectMemorySize) + native memory (JNI) + OS overhead.

Rule of thumb: set container memory limit to 1.3-1.5x your -Xmx value. For a 4GB heap, set container limit to 5.2-6GB. This covers metaspace (~100-200MB), thread stacks (200 threads Γ— 1MB = 200MB), direct memory (~256MB), and OS overhead (~500MB).

Thread stack sizing: -Xss sets stack size per thread. Default is 512KB-1MB depending on OS. For services with many threads, this matters. 500 threads Γ— 1MB = 500MB of stack memory. If your call depth is shallow, reduce to -Xss256k. If you have deep recursion, increase to -Xss2m.

Metaspace sizing: -XX:MaxMetaspaceSize limits metaspace growth. Without this limit, metaspace can consume all available native memory. Set it to a reasonable value (256MB-512MB for most services). If you hit the limit, it indicates a classloader leak, not insufficient space.

JFR continuous recording: -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h β€” enables continuous JFR recording with rolling buffer. When an incident occurs, dump the recording with jcmd <pid> JFR.dump. This gives you allocation, GC, and lock profiling data without restarting the service.

Edge case: -XX:+UseCompressedOops is enabled by default for heaps <32GB. It compresses object pointers from 8 bytes to 4 bytes, saving ~20% heap. Above 32GB, compressed oops are disabled and each object pointer costs 8 bytes. This means a 34GB heap may perform worse than a 31GB heap due to pointer size increase. Either stay under 32GB or go significantly above (40GB+).

production_jvm_flags.sh Β· BASH
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
#!/bin/bash
# Production JVM flags for containerized Java services
# Tested on JDK 17 with G1GC and ZGC configurations

# ============================================================
# BASELINE CONFIGURATION (G1GC β€” suitable for most services)
# ============================================================

JVM_BASE_FLAGS="
  # Memory
  -XX:MaxRAMPercentage=70.0
  -XX:InitialRAMPercentage=50.0
  -XX:MaxMetaspaceSize=256m
  -Xss512k

  # GC β€” G1GC
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=200
  -XX:G1HeapRegionSize=4m
  -XX:InitiatingHeapOccupancyPercent=45

  # Diagnostics (non-negotiable)
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/jvm/heapdump.hprof
  -XX:+ExitOnOutOfMemoryError
  -XX:+CrashOnOutOfMemoryError

  # GC Logging (JDK 11+)
  -Xlog:gc*:file=/var/log/jvm/gc.log:time,uptime,level,tags:filecount=5,filesize=100m

  # JFR continuous recording
  -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h,name=continuous,filename=/var/log/jvm/recording.jfr

  # Compressed oops (auto-enabled <32GB heap)
  -XX:+UseCompressedOops
  -XX:+UseCompressedClassPointers
"

# ============================================================
# LOW-LATENCY CONFIGURATION (ZGC β€” for p99 < 10ms services)
# ============================================================

JVM_ZGC_FLAGS="
  # Memory
  -XX:MaxRAMPercentage=70.0
  -XX:MaxMetaspaceSize=256m
  -Xss512k

  # GC β€” ZGC
  -XX:+UseZGC
  -XX:+ZGenerational          # JDK 21+ generational ZGC
  -XX:ConcGCThreads=4
  -XX:ParallelGCThreads=8

  # Diagnostics
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/jvm/heapdump.hprof
  -XX:+ExitOnOutOfMemoryError
  -Xlog:gc*:file=/var/log/jvm/gc.log:time,uptime,level,tags:filecount=5,filesize=100m
  -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h,name=continuous
"

# ============================================================
# CONTAINER MEMORY CALCULATION
# ============================================================
#
# For a 4GB heap (-XX:MaxRAMPercentage=70.0 on a 5.7GB container):
#
#   Heap:           4000 MB  (70% of 5700MB)
#   Metaspace:       256 MB  (MaxMetaspaceSize)
#   Thread stacks:   200 MB  (400 threads Γ— 512KB)
#   Direct memory:   256 MB  (default = Xmx)
#   GC overhead:     200 MB  (G1GC bookkeeping)
#   Native/JNI:      300 MB  (JNI libraries, socket buffers)
#   OS overhead:     500 MB  (page cache, file descriptors)
#   ----------------------------------------
#   Total:          5712 MB  (container limit: 5.7GB)
#
# Formula: Container = Xmx Γ— 1.43 (round up to nearest 256MB)
# ============================================================

echo "JVM flags configured for production deployment"
echo "Container memory recommendation: Xmx Γ— 1.43"
Mental Model
Container Memory Budget β€” Every Byte Counts
If you set -Xmx equal to container memory, the container OOM killer will strike before the JVM OOM handler runs. No heap dump, no error message, just a dead process.
  • Heap (Xmx): 70% of container memory. This is your working memory for objects.
  • Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. Reduce with -Xss256k if call depth is shallow.
  • Metaspace: 100-256MB for most services. Set MaxMetaspaceSize to prevent runaway growth.
  • Direct memory: default equals Xmx. Set MaxDirectMemorySize explicitly if using NIO/Netty.
  • OS overhead: 300-500MB for page cache, file descriptors, socket buffers. Never allocate 100% of container memory to JVM.
πŸ“Š Production Insight
A Kubernetes deployment set container memory limit to 4GB and -Xmx to 4GB. The service ran fine during normal traffic. During a traffic spike, the container was OOM-killed (exit code 137) every 2-3 hours. No JVM OOM error was logged β€” the OS killed the process before the JVM could detect the issue.
The team added NativeMemoryTracking and discovered the JVM was using 4.8GB total: 4GB heap + 300MB metaspace + 200MB thread stacks + 300MB direct memory. The container limit was 4GB, so the OS killed the process when total usage exceeded the limit.
Cause: -Xmx set equal to container memory limit with no room for off-heap. Effect: container OOM killer terminated the process. Impact: 3-5 restarts per day during peak traffic. Action: increased container limit to 6GB (4GB Γ— 1.5), kept -Xmx at 4GB. Result: zero OOM kills.
Lesson: container memory limit must be 1.3-1.5x the heap size. The extra 30-50% covers off-heap usage that the JVM does not track against -Xmx.
🎯 Key Takeaway
Set container memory to 1.43x your heap size. Always enable heap dump on OOM, GC logging, and JFR. These three flags turn production memory incidents from guesswork into diagnosis. Without them, you are flying blind.
JVM Flag Configuration Decisions

Off-Heap Memory: Direct Buffers, Native Memory, and Thread Stacks

Most JVM memory guides focus exclusively on heap. In production, off-heap memory causes at least 30% of OOM incidents. The container OOM killer does not care whether the memory is heap or off-heap β€” it kills when total usage exceeds the limit.

Direct ByteBuffer β€” allocated via ByteBuffer.allocateDirect(). Lives outside the heap in native memory. Used by NIO channels, Netty, gRPC, and file I/O. The JVM tracks direct buffer usage against -XX:MaxDirectMemorySize (default = -Xmx). If direct buffer allocation exceeds this limit, you get OOM: Direct buffer memory.

The insidious part: direct buffers are freed by a ReferenceQueue-based cleaner, not immediately when the buffer is GC'd. If the application allocates direct buffers faster than the GC and cleaner can reclaim them, you get OOM even though the buffers are technically unreachable. This is a rate problem, not a leak problem.

Thread stacks β€” each thread has a stack of size -Xss. Default is 512KB-1MB. 500 threads Γ— 1MB = 500MB. This memory is allocated at thread creation and never shrinks. In services with dynamic thread pools, thread count can grow under load, consuming more stack memory.

Metaspace β€” class metadata storage. Replaces PermGen (JDK 7). Grows as classes are loaded. Bounded by -XX:MaxMetaspaceSize. Unbounded by default β€” can consume all native memory if not limited.

JNI native memory β€” memory allocated by native libraries via JNI. The JVM does not track this. Common sources: database drivers (OCI, native JDBC), compression libraries (zlib, snappy), and cryptographic providers. Use NativeMemoryTracking to estimate.

MappedByteBuffer β€” file-backed memory mapping via FileChannel.map(). Maps file contents directly into process address space. Not counted against heap or MaxDirectMemorySize. Large memory-mapped files can trigger container OOM.

Diagnosis tool: NativeMemoryTracking (NMT). Enable with -XX:NativeMemoryTracking=detail. Query with jcmd <pid> VM.native_memory summary. NMT shows memory breakdown by category: Java Heap, Class (metaspace), Thread, Code, GC, Internal, Symbol, Malloc, and Mapped.

Performance caveat: NMT adds 5-10% overhead in detail mode. Use -XX:NativeMemoryTracking=summary for production (1-2% overhead). Switch to detail mode only during active debugging.

Edge case: Netty's PooledByteBufAllocator recycles direct buffers to avoid allocation overhead. If the pool grows under load, it retains memory even after the buffers are released. Monitor Netty's pool metrics (PooledByteBufAllocator.metric()) to detect pool bloat.

off_heap_monitor.java Β· JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
package io.thecodeforge.monitoring;

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.ThreadMXBean;
import java.nio.ByteBuffer;
import java.util.HashMap;
import java.util.Map;

/**
 * Off-Heap Memory Monitor β€” tracks memory usage outside
 * the JVM heap that contributes to container OOM kills.
 */
public class OffHeapMonitor {

    public static class OffHeapReport {
        public long metaspaceUsedMB;
        public long metaspaceMaxMB;
        public long threadStackMB;
        public int threadCount;
        public long directMemoryMaxMB;
        public long compressedClassSpaceMB;
        public long codeCacheMB;
        public Map<String, String> recommendations = new HashMap<>();

        public long totalOffHeapMB() {
            return metaspaceUsedMB + threadStackMB
                + compressedClassSpaceMB + codeCacheMB;
        }
    }

    public static OffHeapReport analyze() {
        OffHeapReport report = new OffHeapReport();

        // Metaspace
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            String name = pool.getName();
            MemoryUsage usage = pool.getUsage();

            if (name.contains("Metaspace")) {
                report.metaspaceUsedMB = usage.getUsed() / (1024 * 1024);
                report.metaspaceMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : -1;
            } else if (name.contains("Compressed Class Space")) {
                report.compressedClassSpaceMB = usage.getUsed() / (1024 * 1024);
            } else if (name.contains("Code Cache")) {
                report.codeCacheMB = usage.getUsed() / (1024 * 1024);
            }
        }

        // Thread stacks
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        report.threadCount = threadBean.getThreadCount();
        // Estimate: each thread uses -Xss (default ~1MB)
        // More accurate: check -XX:ThreadStackSize via VM flags
        report.threadStackMB = report.threadCount; // Rough estimate: 1MB per thread

        // Direct memory limit
        try {
            long maxDirectMemory = sun.misc.VM.maxDirectMemory();
            report.directMemoryMaxMB = maxDirectMemory / (1024 * 1024);
        } catch (Exception e) {
            report.directMemoryMaxMB = -1;
        }

        // Recommendations
        if (report.metaspaceUsedMB > 200) {
            report.recommendations.put("metaspace",
                "Metaspace using " + report.metaspaceUsedMB
                + "MB β€” check for classloader leaks");
        }
        if (report.threadCount > 300) {
            report.recommendations.put("threads",
                report.threadCount + " threads active β€” "
                + report.threadStackMB + "MB in stacks. "
                + "Consider reducing thread pool size or -Xss.");
        }
        long totalOffHeap = report.totalOffHeapMB();
        if (totalOffHeap > 1024) {
            report.recommendations.put("total",
                "Total off-heap: " + totalOffHeap + "MB. "
                + "Ensure container memory limit accounts for this.");
        }

        return report;
    }

    /**
     * Monitor direct buffer allocation rate.
     * Call this periodically to detect direct memory pressure.
     */
    public static long getDirectMemoryUsedEstimate() {
        // NMT is more accurate, but this gives a quick estimate
        // by attempting a small allocation and checking if it succeeds
        try {
            ByteBuffer test = ByteBuffer.allocateDirect(1024);
            test = null;
            return -1; // Allocation succeeded β€” no pressure
        } catch (OutOfMemoryError e) {
            return 0; // Direct memory exhausted
        }
    }
}
Mental Model
The Hidden 30% β€” Off-Heap Memory Budget
The container OOM killer measures total memory β€” heap + off-heap. If you only monitor heap, you are watching 70% of the problem.
  • Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. This grows if thread pool scales up under load.
  • Metaspace: 100-256MB typical. Unbounded by default. Set MaxMetaspaceSize to prevent runaway growth.
  • Direct buffers: tracked by MaxDirectMemorySize. Default equals Xmx. Netty pools can retain memory even after release.
  • Native memory: JNI libraries, socket buffers, file descriptors. Not tracked by JVM. Use NMT for estimates.
  • MappedByteBuffer: file-backed mapping. Not counted against heap or direct memory. Large files can trigger container OOM.
πŸ“Š Production Insight
A gRPC service using Netty experienced container OOM kills despite heap usage never exceeding 60%. The team was baffled β€” heap monitoring showed no pressure.
NativeMemoryTracking revealed the issue: Netty's PooledByteBufAllocator had grown to 1.8GB of direct buffers during a traffic spike. The pool retained these buffers even after the gRPC calls completed, waiting for reuse. The container had 4GB limit, 2.4GB heap, 1.8GB Netty pool, 400MB other off-heap = 4.6GB total. Container OOM killer struck.
Cause: Netty pooled allocator retained 1.8GB of direct buffers. Effect: total memory exceeded 4GB container limit. Impact: 4-6 container OOM kills per day during peak traffic. Action: set Netty's PooledByteBufAllocator maxOrder=8 (reduced pool size) and added -XX:MaxDirectMemorySize=512m. Result: direct buffer usage stabilized at 400MB, no OOM kills.
Lesson: Netty's buffer pool is off-heap and invisible to heap monitoring. Always monitor total JVM memory (heap + off-heap), not just heap.
🎯 Key Takeaway
Off-heap memory is invisible to heap monitoring but visible to the container OOM killer. Enable NativeMemoryTracking, monitor thread count, and set explicit limits for direct memory and metaspace. The container measures total memory, not just heap.
Off-Heap Memory Troubleshooting

Building a Production Memory Monitoring Stack

Memory incidents are preventable with the right monitoring. The goal is to detect problems hours before they cause OOM β€” not after.

Layer 1 β€” JVM metrics (Prometheus/JMX): Expose heap usage, GC pause times, GC count, thread count, and metaspace usage via JMX. Use Micrometer or JMX Exporter for Prometheus integration. Key alerts: - Old gen usage after GC > 70% for 10 minutes β†’ warning - Old gen usage after GC > 85% for 5 minutes β†’ critical - GC pause p99 > 500ms β†’ warning - GC pause p99 > 2s β†’ critical - Thread count > 80% of max pool size β†’ warning - Full GC count > 0 in last hour β†’ investigate

Layer 2 β€” Container metrics (cAdvisor/Kubernetes): Monitor container memory usage (not just JVM heap). Key alerts: - Container memory > 85% of limit β†’ warning - Container memory > 95% of limit β†’ critical (OOM imminent) - Container restart count > 0 in last hour β†’ investigate

Layer 3 β€” Application-level metrics: Track object counts for known leak-prone structures: session cache size, connection pool size, thread-local count. These are domain-specific and catch leaks that JVM metrics miss.

Alerting philosophy: Alert on trends, not thresholds. A heap at 80% is fine if it returns to 40% after GC. A heap at 60% is a problem if it never drops below 55% after GC. The post-GC baseline trend is the most important metric.

Automated remediation: For containerized services, configure liveness probes that check heap usage. If heap exceeds 90%, the probe fails and Kubernetes restarts the pod. This is a safety net, not a fix β€” but it prevents the service from running in a degraded state while you investigate.

Retention and analysis: Keep GC logs and heap dumps for at least 7 days. Memory leaks can take days to manifest. If you only keep 24 hours of logs, you lose the trend data needed for diagnosis. Store dumps in object storage (S3, GCS) with lifecycle policies.

Production insight: the monitoring stack itself must not consume significant memory. A common mistake is running a heavy APM agent (100-200MB overhead) alongside the JVM. In a 2GB heap container, the agent consumes 5-10% of total memory. Use lightweight agents (JMX Exporter <20MB) or expose metrics via an HTTP endpoint without an agent.

memory_metrics_exporter.java Β· JAVA
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
package io.thecodeforge.monitoring;

import com.sun.management.OperatingSystemMXBean;
import java.lang.management.*;
import java.util.HashMap;
import java.util.Map;

/**
 * Memory Metrics Exporter β€” exposes JVM memory metrics
 * for Prometheus/monitoring integration.
 *
 * Lightweight alternative to heavy APM agents.
 * Estimated overhead: <5MB heap, <0.1% CPU.
 */
public class MemoryMetricsExporter {

    public static class MemoryMetrics {
        // Heap
        public long heapUsedMB;
        public long heapMaxMB;
        public long heapCommittedMB;
        public double heapUsagePercent;

        // Young gen
        public long youngGenUsedMB;
        public long youngGenMaxMB;

        // Old gen
        public long oldGenUsedMB;
        public long oldGenMaxMB;
        public double oldGenUsagePercent;

        // Off-heap
        public long metaspaceUsedMB;
        public long metaspaceMaxMB;
        public long threadCount;
        public long threadStackEstimateMB;
        public long directMemoryMaxMB;

        // GC
        public long youngGcCount;
        public long youngGcTimeMs;
        public long fullGcCount;
        public long fullGcTimeMs;
        public double gcTimePercent;

        // Container
        public long containerMemoryLimitMB;
        public long processPhysicalMemoryMB;
        public double containerUsagePercent;
    }

    public static MemoryMetrics collect() {
        MemoryMetrics m = new MemoryMetrics();
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();

        // Heap
        MemoryUsage heap = memBean.getHeapMemoryUsage();
        m.heapUsedMB = heap.getUsed() / (1024 * 1024);
        m.heapMaxMB = heap.getMax() / (1024 * 1024);
        m.heapCommittedMB = heap.getCommitted() / (1024 * 1024);
        m.heapUsagePercent = (double) heap.getUsed() / heap.getMax() * 100;

        // Memory pools
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            String name = pool.getName();
            MemoryUsage usage = pool.getUsage();
            if (name.contains("Eden") || name.contains("Survivor")) {
                m.youngGenUsedMB += usage.getUsed() / (1024 * 1024);
                if (usage.getMax() > 0) {
                    m.youngGenMaxMB += usage.getMax() / (1024 * 1024);
                }
            } else if (name.contains("Old") || name.contains("Tenured")) {
                m.oldGenUsedMB = usage.getUsed() / (1024 * 1024);
                m.oldGenMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : 0;
                m.oldGenUsagePercent = m.oldGenMaxMB > 0
                    ? (double) m.oldGenUsedMB / m.oldGenMaxMB * 100 : 0;
            } else if (name.contains("Metaspace")) {
                m.metaspaceUsedMB = usage.getUsed() / (1024 * 1024);
                m.metaspaceMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : -1;
            }
        }

        // GC stats
        long uptimeMs = ManagementFactory.getRuntimeMXBean().getUptime();
        for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) {
            String name = gc.getName();
            if (name.contains("Young") || name.contains("Scavenge")
                    || name.contains("G1 Young")) {
                m.youngGcCount = gc.getCollectionCount();
                m.youngGcTimeMs = gc.getCollectionTime();
            } else if (name.contains("Old") || name.contains("MarkSweep")
                    || name.contains("G1 Old")) {
                m.fullGcCount = gc.getCollectionCount();
                m.fullGcTimeMs = gc.getCollectionTime();
            }
        }
        m.gcTimePercent = (double)(m.youngGcTimeMs + m.fullGcTimeMs)
            / uptimeMs * 100;

        // Threads
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        m.threadCount = threadBean.getThreadCount();
        m.threadStackEstimateMB = m.threadCount; // ~1MB per thread estimate

        // Container / OS memory
        try {
            OperatingSystemMXBean osBean = (OperatingSystemMXBean)
                ManagementFactory.getOperatingSystemMXBean();
            long totalPhysical = osBean.getTotalPhysicalMemorySize();
            long freePhysical = osBean.getFreePhysicalMemorySize();
            m.processPhysicalMemoryMB = (totalPhysical - freePhysical)
                / (1024 * 1024);
            m.containerMemoryLimitMB = totalPhysical / (1024 * 1024);
            m.containerUsagePercent = (double) m.processPhysicalMemoryMB
                / m.containerMemoryLimitMB * 100;
        } catch (Exception e) {
            // Not available on all JVMs
        }

        return m;
    }

    public static String toPrometheusFormat(MemoryMetrics m) {
        StringBuilder sb = new StringBuilder();
        sb.append("# HELP jvm_memory_heap_used_bytes JVM heap used\n");
        sb.append("# TYPE jvm_memory_heap_used_bytes gauge\n");
        sb.append("jvm_memory_heap_used_bytes ")
            .append(m.heapUsedMB * 1024 * 1024).append("\n\n");

        sb.append("# HELP jvm_memory_old_gen_usage_percent Old gen usage\n");
        sb.append("# TYPE jvm_memory_old_gen_usage_percent gauge\n");
        sb.append("jvm_memory_old_gen_usage_percent ")
            .append(String.format("%.2f", m.oldGenUsagePercent)).append("\n\n");

        sb.append("# HELP jvm_gc_full_count Full GC count\n");
        sb.append("# TYPE jvm_gc_full_count counter\n");
        sb.append("jvm_gc_full_count ").append(m.fullGcCount).append("\n\n");

        sb.append("# HELP jvm_memory_container_usage_percent Container memory usage\n");
        sb.append("# TYPE jvm_memory_container_usage_percent gauge\n");
        sb.append("jvm_memory_container_usage_percent ")
            .append(String.format("%.2f", m.containerUsagePercent)).append("\n");

        return sb.toString();
    }
}
Mental Model
Three-Layer Memory Monitoring
Each layer catches what the others miss. Layer 1 catches heap leaks. Layer 2 catches container OOM. Layer 3 catches domain-specific growth.
  • Layer 1 (JVM): heap usage, GC pauses, GC count, metaspace, thread count. Catches heap leaks and GC problems.
  • Layer 2 (Container): total memory usage, restart count, OOM kill count. Catches off-heap issues that JVM metrics miss.
  • Layer 3 (Application): session cache size, connection pool size, custom object counts. Catches domain-specific leaks.
  • Alert on trends: post-GC old gen baseline rising = leak. Post-GC old gen stable = right-sizing issue.
  • Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses trend data.
πŸ“Š Production Insight
A team had comprehensive JVM monitoring (heap, GC, threads) but no container-level monitoring. They experienced intermittent OOM kills that their JVM metrics did not predict. The issue was off-heap growth from a native compression library that consumed 800MB during peak traffic.
Adding container memory monitoring (cAdvisor + Prometheus) immediately revealed the pattern: container memory grew to 95% of limit while heap stayed at 60%. The team added a container memory alert at 85% and got 30 minutes of warning before each OOM kill.
Cause: native compression library allocated 800MB outside JVM heap. Effect: container OOM kills with no JVM-level warning. Impact: 2-3 unexpected restarts per week. Action: added container memory alerting, reduced compression buffer size, increased container limit. Result: zero OOM kills, 30+ minute early warning on memory pressure.
Monitoring overhead: the JMX Exporter added <5MB heap overhead and <0.1% CPU. The alternative (Datadog APM agent) would have added 150MB heap overhead β€” 7.5% of the 2GB container. Lightweight monitoring is essential in memory-constrained containers.
🎯 Key Takeaway
Three layers of monitoring: JVM (heap, GC, threads), container (total memory, OOM kills), and application (caches, pools). Alert on post-GC old gen trends, not absolute values. Keep diagnostic data for 7+ days β€” leaks take time to manifest.
Memory Monitoring Stack Decisions
πŸ—‚ JVM Memory Issues Compared
Side-by-side comparison of common JVM memory problems with root causes and fixes.
SituationCommon CauseBest Fix
OOM: Java heap spaceMemory leak or undersized heapAnalyze heap dump with MAT. Find leak via dominator tree. Fix leak, then right-size heap.
OOM: MetaspaceClassLoader leak in hot-redeploy environmentRestart JVM on redeploy. Avoid static references to dynamic classloaders. Use WeakHashMap.
OOM: Direct buffer memoryNetty/NIO buffer leak or insufficient MaxDirectMemorySizeEnable ResourceLeakDetector. Set MaxDirectMemorySize explicitly. Monitor with NMT.
GC overhead limit exceededMemory leak β€” GC cannot free enough memoryAnalyze heap dump. Fix the leak. Increasing heap only delays the crash.
StackOverflowErrorInfinite recursion or deep call stackConvert recursion to iteration. Increase -Xss if deep recursion is intentional.
Container OOM kill (exit 137)Total memory (heap + off-heap) exceeds container limitSet container limit to 1.43x heap. Add NativeMemoryTracking. Monitor container memory.
GC pauses >1 secondFull GC on large heap with G1GCSwitch to ZGC (sub-ms pauses) or tune G1GC MaxGCPauseMillis and IHOP.
Memory grows but no single leak objectDistributed leak (ThreadLocal, unbounded cache)Compare heap histograms over time. Check ThreadLocal.remove() and cache eviction.
OOM only at high trafficAllocation rate exceeds GC throughputReduce allocation rate (object pooling, caching). Switch to higher-throughput GC.
OOM after code deploymentNew code introduced leak or removed cleanupDiff deployed code. Look for new caches, new ThreadLocal, removed eviction logic.
Heap at 80% but stable β€” no leakWorking set is legitimately largeRight-size heap. Working set Γ— 2 is a good starting point. Not every high-usage is a leak.
Humongous allocations in GC logsObjects >50% of G1 region sizeIncrease G1HeapRegionSize or refactor large byte[]/StringBuilder allocations.
SoftReference cache consuming all heapJVM only collects SoftReferences under heap pressureSwitch to size-bounded cache (Caffeine) with explicit eviction.
Netty buffer pool growing unboundedPooledByteBufAllocator retains buffers under loadSet maxOrder limit. Monitor pool metrics. Use -XX:MaxDirectMemorySize.

🎯 Key Takeaways

  • Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging.
  • Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type.
  • GC tuning is about trade-offs: throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC.
  • Monitor post-GC old gen baseline, not peak heap usage. A rising baseline confirms a leak. Peak usage is irrelevant for leak detection.
  • Set container memory to 1.43x your heap size. Off-heap memory (thread stacks, metaspace, direct buffers) is invisible to heap monitoring but visible to the container OOM killer.
  • ThreadLocal and unbounded caches are the most common production leak sources. Always call ThreadLocal.remove() in a finally block. Always set maximumSize on caches.
  • Three non-negotiable production flags: -XX:+HeapDumpOnOutOfMemoryError, GC logging, and -XX:+ExitOnOutOfMemoryError. Without them, you are flying blind.
  • Enable NativeMemoryTracking to profile off-heap memory. Container OOM kills with normal heap usage indicate off-heap pressure.
  • Netty's PooledByteBufAllocator retains direct buffers even after release. Monitor pool metrics and set explicit MaxDirectMemorySize.
  • Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses the trend data needed for diagnosis.
  • Print the symptom-to-tool map and the five essential commands. When the alert fires at 2 AM, you need to triage in 60 seconds, not 45 minutes.
  • Run fast commands first (jmap -histo, jstat -gcutil). Run slow commands (jmap -dump) only when fast commands do not reveal the issue.

⚠ Common Mistakes to Avoid

  • βœ•Treating all OOM errors the same β€” each type (heap, metaspace, direct, GC overhead, stack) has a different cause and fix.
  • βœ•Not setting -XX:+HeapDumpOnOutOfMemoryError β€” without it, you have zero diagnostic data after an OOM crash.
  • βœ•Setting -Xmx equal to container memory limit β€” the container OOM killer strikes before the JVM OOM handler, leaving no heap dump.
  • βœ•Doubling heap size without understanding the leak β€” this just delays the crash and makes the heap dump twice as large to analyze.
  • βœ•Monitoring peak heap usage instead of post-GC old gen baseline β€” peak usage depends on allocation rate and GC timing, not leak presence.
  • βœ•Using plain HashMap for session storage with manual cleanup β€” use Caffeine or Guava cache with TTL and maximumSize.
  • βœ•Calling ThreadLocal.set() without ThreadLocal.remove() in thread pool environments β€” ThreadLocal values persist for the thread's lifetime.
  • βœ•Using String.intern() on user-generated input β€” creates an unbounded string pool that grows with every unique string.
  • βœ•Not enabling GC logging in production β€” GC logs are essential for diagnosing pause time issues and allocation rate problems.
  • βœ•Using a heavy APM agent in memory-constrained containers β€” 150MB agent overhead in a 2GB container is 7.5% of total memory.
  • βœ•Switching GC collectors without understanding the workload β€” ZGC adds CPU overhead that is wasted on batch jobs that do not care about latency.
  • βœ•Not monitoring container memory alongside JVM heap β€” off-heap memory (thread stacks, metaspace, direct buffers) can be 30-50% of total usage.
  • βœ•Keeping GC logs and heap dumps for only 24 hours β€” memory leaks take days to manifest, requiring longer retention for trend analysis.
  • βœ•Ignoring humongous allocations in G1GC β€” objects >50% of region size cause premature full GC and performance degradation.
  • βœ•Setting -XX:MaxMetaspaceSize too low β€” Metaspace OOM is usually a classloader leak, not insufficient space. Fix the leak, not the limit.
  • βœ•Not accounting for compressed oops boundary at 32GB β€” a 34GB heap can perform worse than 31GB due to pointer size increase.
  • βœ•Using SoftReference-based caches β€” JVM only collects SoftReferences under heap pressure, allowing them to consume all available memory.
  • βœ•Running wrong diagnostic tool for the symptom β€” jmap does not help with direct memory, GC logs do not help with stack overflow.
  • βœ•Running jmap -dump before jmap -histo β€” histogram is fast and often reveals the problem without needing the slow full dump.
  • βœ•Not having a standardized diagnostic script β€” ad-hoc debugging at 2 AM wastes 20+ minutes per incident.

Interview Questions on This Topic

  • QWalk me through how you would debug an OOM: Java heap space error in a production service. What tools would you use, what would you look for, and how would you confirm the fix?
  • QYour service is running in a Kubernetes pod with 4GB memory limit. The pod gets killed with exit code 137 every few hours, but your JVM heap monitoring shows usage never exceeds 60%. What is happening and how would you fix it?
  • QExplain the difference between the five OOM types in the JVM. For each, what is the typical root cause and what diagnostic tool would you use?
  • QYou need to reduce GC pause times from 500ms to under 10ms for a latency-sensitive trading service. What GC collector would you choose, what parameters would you tune, and what trade-offs would you accept?
  • QA memory leak in production causes OOM every 20 hours. You take a heap dump but the dominator tree shows no single object dominating memory. How do you find the leak?
  • QExplain the sawtooth pattern in JVM heap usage. How do you distinguish a healthy sawtooth from a memory leak? What metric do you monitor?
  • QYour team wants to switch from G1GC to ZGC because they read it is 'faster.' What questions would you ask before approving the change, and what trade-offs would you explain?
  • QDesign a memory monitoring stack for a fleet of 200 Java microservices running in Kubernetes. What metrics would you collect, what alerts would you set, and how would you keep overhead minimal?
  • QA service uses Netty for HTTP handling. Container memory usage grows to 95% of limit but heap usage is only 50%. Diagnose the issue and explain the fix.
  • QYou inherit a codebase with 50 ThreadLocal usages across the application. How would you audit them for leaks, and what patterns would you enforce to prevent ThreadLocal leaks in thread pool environments?
  • QIt is 3 AM and you just received an OOM alert. You have 60 seconds before the on-call escalation. Walk me through the exact commands you would run and in what order.
  • QYour JVM flags include -Xmx4g and the container memory limit is 4GB. Explain why this is wrong and how you would fix it.

Frequently Asked Questions

What is the difference between OOM: Java heap space and GC overhead limit exceeded?

OOM: Java heap space means the heap is full and GC cannot free enough space for the current allocation. GC overhead limit exceeded means GC is running continuously (>98% of time) and recovering almost nothing (<2% of heap). Both indicate memory pressure, but GC overhead is the JVM's way of saying 'I tried GC and it did not help β€” you have a leak.' Fix the leak, do not just increase heap.

How much memory should I allocate to the JVM in a container?

Set container memory to 1.43x your -Xmx value. For a 4GB heap, set container limit to 5.7GB. The extra 30-43% covers metaspace (~256MB), thread stacks (~200MB for 200 threads), direct memory (~256MB), GC overhead (~200MB), and OS overhead (~500MB). Use -XX:MaxRAMPercentage=70.0 to set heap as 70% of container memory.

How do I find a memory leak in production?

Step 1: confirm it is a leak by monitoring post-GC old gen baseline β€” if it rises over hours, it is a leak. Step 2: take two heap dumps 30-60 minutes apart. Step 3: compare histograms in Eclipse MAT to find the fastest-growing object type. Step 4: follow the reference chain to GC root to find who holds the reference. Step 5: fix the reference (add eviction, call remove(), close the resource).

Should I use G1GC, ZGC, or Parallel GC?

G1GC for most services (good balance). ZGC for latency-sensitive services requiring p99 < 10ms (trading, real-time). Parallel GC for batch jobs where throughput matters and pauses are acceptable. Do not switch to ZGC just because it is newer β€” it adds CPU overhead that is wasted if you do not need sub-millisecond pauses.

My container is killed with exit code 137 but no JVM OOM error β€” what happened?

The Linux OOM killer terminated your process because total memory (heap + off-heap) exceeded the container memory limit. The JVM did not OOM β€” the OS killed it. Check if -Xmx equals container memory limit (wrong). Increase container limit to 1.43x heap size. Add NativeMemoryTracking to profile off-heap usage.

How do I diagnose Metaspace OOM?

Metaspace OOM is almost always a classloader leak, not insufficient space. Check if your service uses hot-redeploy without JVM restart. Use jcmd VM.classloader_stats to see classloader counts. Look for classloaders with high class counts that should have been unloaded. The fix is usually avoiding static references to classes from dynamic classloaders.

What is the sawtooth pattern and how does it help detect leaks?

Healthy JVM: heap rises during allocation, drops after GC, returns to the same baseline each time. Leaking JVM: the post-GC baseline rises over time. Monitor old gen usage after each full GC. If the baseline increases monotonically over hours, you have a leak. The post-GC baseline is the only metric that reveals a leak β€” peak usage is irrelevant.

How do I tune GC pause times?

First, determine if pauses are actually a problem β€” measure p99 latency. If GC pauses exceed your SLA, options: (1) tune G1GC with -XX:MaxGCPauseMillis and -XX:InitiatingHeapOccupancyPercent, (2) switch to ZGC for sub-millisecond pauses, (3) reduce allocation rate to decrease GC frequency, (4) increase heap to give GC more room. Always enable GC logging to measure the impact of changes.

How do I handle memory in a high-throughput service that allocates a lot of short-lived objects?

Ensure young gen is large enough to hold the working set of short-lived objects. In G1GC, this is automatic. In Parallel GC, tune -XX:NewRatio. Consider object pooling for frequently allocated large objects (but benchmark first β€” pooling adds complexity and can cause leaks). The most effective optimization is reducing allocation rate: reuse StringBuilder, avoid autoboxing in loops, use primitive collections.

What JVM flags are essential for production?

Non-negotiable: -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/log/jvm/, GC logging (-Xlog:gc*), -XX:+ExitOnOutOfMemoryError. Recommended: -XX:MaxRAMPercentage=70.0 (container), -XX:MaxMetaspaceSize=256m, JFR continuous recording. These flags turn production incidents from guesswork into diagnosis.

What are the five essential debug commands for a JVM memory incident?
  1. jcmd VM.native_memory summary β€” shows where all JVM memory is going. 2) jmap -histo:live | head -30 β€” shows top 30 object types by count and size. 3) jstat -gcutil <pid> 1000 β€” shows GC activity in real-time. 4) jcmd GC.heap_dump β€” full heap dump for MAT analysis. 5) jstack <pid> β€” thread dump for StackOverflowError and ThreadLocal leaks. Run in this order β€” fast commands first.
πŸ”₯
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousJVM Memory ModelNext β†’JVM GC Tuning Guide: G1, ZGC, Shenandoah Explained with Real Trade-offs
Forged with πŸ”₯ at TheCodeForge.io β€” Where Developers Are Forged