JVM Memory Issues in Production: Debugging Guide (OOM, GC, Leaks)
- Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging.
- Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type.
- GC tuning is about trade-offs: throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC.
- OOM errors have 5 types: Heap, Metaspace, Direct Memory, Stack Overflow, and GC Overhead β each has a different root cause and fix.
- Always capture heap dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError β without it, you are guessing.
- GC pauses above 200ms in latency-sensitive services indicate tuning problems β switch collectors or adjust generation ratios.
- Memory leaks show as a sawtooth pattern that never returns to baseline after GC β analyze dominator tree in heap dump to find the root object.
- Metaspace OOM usually means classloader leaks in hot-redeploy environments β not insufficient Metaspace size.
- Production rule: set -Xmx to 70-75% of container memory β the remaining 25-30% covers off-heap, thread stacks, and OS page cache.
Pod killed (exit code 137) β no JVM error in logs
kubectl describe pod <pod> | grep -A5 "Last State"kubectl top pod <pod> --containersjava.lang.OutOfMemoryError: Java heap space
jmap -histo:live <pid> | head -30jmap -dump:live,format=b,file=/tmp/heap.hprof <pid>Response times spiking β service is slow but not crashed
jstat -gcutil <pid> 1000tail -100 /var/log/jvm/gc.logCPU at 100% β service is thrashing
jstat -gcutil <pid> 1000top -Hp <pid>java.lang.OutOfMemoryError: Metaspace
jcmd <pid> VM.classloader_statsjcmd <pid> GC.class_statsjava.lang.OutOfMemoryError: Direct buffer memory
jcmd <pid> VM.native_memory summaryCheck -XX:MaxDirectMemorySize settingjava.lang.StackOverflowError
jstack <pid> | grep -A 50 "REPEATING_METHOD"Check stack trace for repeating method signaturesService runs for hours then OOM β slow leak
jstat -gcutil <pid> 60000 (monitor every minute for 30 min)jmap -histo:live <pid> > /tmp/histo1.txt (repeat after 1 hour)Production Incident
Production Debug GuideSymptom-to-action guide for the memory issues you will actually encounter at 2 AM
JVM memory failures are the most common cause of unplanned downtime in Java-based production systems. An OOM kill at 2 AM takes down the service, triggers alerts, and forces on-call engineers to diagnose under pressure.
Most OOM errors are preventable. The JVM provides extensive diagnostics β heap dumps, GC logs, JFR recordings β but teams rarely configure them before the incident. By the time the OOM fires, the evidence is already gone unless you captured it proactively.
This guide covers the five OOM types, GC tuning trade-offs, memory leak detection patterns, and the production configurations that prevent most memory-related outages. Every pattern comes from systems running at scale β not textbook examples.
Start with the Quick Debug Cheat Sheet above if you are actively debugging an incident. Use the sections below for deep understanding and prevention.
Production Debugging Quick Map β Symptom to Tool
When a memory incident fires, you need to go from symptom to correct diagnostic tool in seconds. This map is designed to be printed and taped to your monitor.
The key insight: each symptom points to a specific memory region and a specific tool. Using the wrong tool wastes hours. A heap dump does not help with direct memory OOM. GC logs do not help with stack overflow. Match the symptom to the tool.
Decision flow: 1. Read the error message or symptom. 2. Find the matching row in the table below. 3. Run the diagnostic command. 4. Apply the fix.
Severity triage: - Service crashed (OOM) β critical β capture diagnostics immediately - Service degraded (slow) β high β capture diagnostics within 15 minutes - Service trending toward OOM β medium β schedule diagnostics within 1 hour - No symptoms, proactive check β low β run diagnostics during maintenance window
The table below covers the 12 most common production memory scenarios. Each row maps symptom β what to check β which tool β immediate action. This is the scan-first view β use it before reading any section in detail.
Production insight: the most time-consuming part of memory debugging is choosing the right tool. Engineers waste hours running jmap when they should be reading GC logs, or analyzing heap dumps when the issue is off-heap. This table eliminates that wasted time by mapping symptoms directly to tools.
SYMPTOM | WHAT TO CHECK | TOOL | IMMEDIATE ACTION ---------------------------------+--------------------------------+--------------------------------+------------------------------------------- Exit code 137 (no JVM error) | Container memory vs heap | kubectl top + jcmd VM.native | Increase container to 1.43x heap OOM: Java heap space | Heap contents (what is big?) | jmap -histo + Eclipse MAT | Find leak via dominator tree OOM: Metaspace | Classloader count | jcmd VM.classloader_stats | Restart JVM, fix classloader leak OOM: Direct buffer memory | Direct buffer allocation | jcmd VM.native_memory summary | Fix buffer leak, set MaxDirectMemorySize OOM: GC overhead limit | GC frequency + old gen usage | jstat -gcutil + GC logs | Fix memory leak (GC cannot free enough) StackOverflowError | Call stack depth | jstack <pid> | Convert recursion to iteration Latency spikes (no OOM) | GC pause times | GC logs (-Xlog:gc) | Tune GC or switch collector CPU high + slow response | GC time percentage | jstat -gcutil <pid> 1s | If GC time > 5%, fix leak or increase heap Memory grows over hours | Old gen trend (post-GC) | jstat -gcutil + jmap -histo | Compare histograms, find growing types OOM only at high traffic | Allocation rate | JFR (settings=profile) | Reduce allocation rate or increase heap OOM only in production | Object count comparison | jmap -histo (prod vs staging) | Find data-dependent leak OOM after code deploy | Code diff (new caches/threads) | git diff + heap dump | Check for removed eviction logic
- Error message contains 'heap space' β heap region β jmap + MAT β leak or undersized heap.
- Error message contains 'Metaspace' β class metadata β jcmd classloader_stats β classloader leak.
- Error message contains 'Direct buffer' β off-heap NIO β jcmd native_memory β buffer leak.
- Error message contains 'GC overhead' β GC cannot free β heap dump β memory leak confirmed.
- No error message, just exit code 137 β container limit β kubectl top β off-heap exceeded container limit.
- No crash, just slow β GC pauses β GC logs β collector tuning or leak.
Essential JVM Debug Commands β Complete Reference
Every production JVM memory incident requires specific commands. This section is the complete reference β categorized by tool, with exact syntax and what to look for in the output.
These commands assume JDK 11+ syntax. For JDK 8, some flags differ (noted where applicable).
Critical rule: always run diagnostic commands as the same user that owns the JVM process. In containers, exec into the container: kubectl exec -it <pod> -- /bin/bash.
jcmd β the Swiss Army knife. Replaces jinfo, jmap, jstack, and jstat for most operations. Available on all JDK 11+ installations. One tool, many functions.
jmap β heap dump and histogram. The primary tool for heap analysis. jmap -histo:live forces a full GC before counting, showing only live objects. jmap -dump:live creates a heap dump file for MAT analysis.
jstat β real-time GC monitoring. Shows GC activity in real-time without stopping the JVM. The -gcutil flag shows usage percentages for each generation. Run with 1-second interval for live debugging.
jstack β thread dump. Shows all threads and their stack traces. Essential for StackOverflowError and thread-related memory issues (ThreadLocal accumulation).
JFR β Java Flight Recorder. Low-overhead continuous profiling. Captures allocation patterns, GC events, and lock contention. Can run in production with <2% overhead.
Production insight: the most commonly confused commands are jmap -histo (object counts, fast) and jmap -dump (full heap dump, slow, pauses JVM). Use -histo first to get a quick overview. Only use -dump when you need the full object graph for MAT analysis. Dumping a 16GB heap pauses the JVM for 10-30 seconds.
Edge case: in Kubernetes, the JVM process PID is usually 1 (the container entrypoint). If your container runs a wrapper script, the JVM PID may be different. Use ps aux | grep java to find the actual PID. Some commands require JAVA_HOME to be set β verify with echo $JAVA_HOME before running.
#!/bin/bash # ============================================================ # JVM Debug Commands β Production Reference # Run from inside the container or on the host with JVM access # ============================================================ PID=$(pgrep -f 'java.*-Xmx') # Find JVM PID # ============================================================ # JCMD β Swiss Army Knife (JDK 11+) # ============================================================ # List all JVM processes jcmd # JVM summary (uptime, arguments, heap config) jcmd $PID VM.info # Native memory breakdown (heap, thread, class, GC, direct) jcmd $PID VM.native_memory summary jcmd $PID VM.native_memory summary.diff # Since last baseline jcmd $PID VM.native_memory baseline # Set baseline for diff # Classloader statistics (class count, classloader count) jcmd $PID VM.classloader_stats # GC class statistics (instance count and size by class) jcmd $PID GC.class_stats | head -20 # Force full GC jcmd $PID GC.run # Print all VM flags jcmd $PID VM.flags -all | grep -E '(HeapDump|GC|Metaspace|DirectMemory|ThreadStackSize)' # Print command line flags (shows effective GC settings) jcmd $PID VM.command_line # Thread dump (replaces jstack) jcmd $PID Thread.print # Heap dump jcmd $PID GC.heap_dump /tmp/heap.hprof # Heap histogram (live objects only, forces GC) jcmd $PID GC.class_histogram | head -30 # JFR: start recording jcmd $PID JFR.start name=debug settings=profile maxsize=100M maxage=1h # JFR: dump recording jcmd $PID JFR.dump name=debug filename=/tmp/recording.jfr # JFR: stop recording jcmd $PID JFR.stop name=debug # ============================================================ # JMAP β Heap Dump and Histogram # ============================================================ # Histogram of live objects (top 30 by count) jmap -histo:live $PID | head -30 # Histogram of all objects (including unreachable β faster, no GC) jmap -histo $PID | head -30 # Full heap dump (live objects only β forces GC first) jmap -dump:live,format=b,file=/tmp/heap.hprof $PID # Full heap dump (all objects β faster but larger file) jmap -dump:format=b,file=/tmp/heap_all.hprof $PID # ============================================================ # JSTAT β Real-Time GC Monitoring # ============================================================ # GC utilization every 1 second, 10 samples jstat -gcutil $PID 1000 10 # Output columns: # S0 β Survivor 0 usage % # S1 β Survivor 1 usage % # E β Eden usage % # O β Old gen usage % β KEY METRIC for leak detection # M β Metaspace usage % # CCS β Compressed class space usage % # YGC β Young GC count # YGCT β Young GC total time (seconds) # FGC β Full GC count β SHOULD BE 0 in healthy service # FGCT β Full GC total time (seconds) # GCT β Total GC time (seconds) # Key diagnostic: # If O (old gen) keeps growing after GC β memory leak # If FGC > 0 and increasing β old gen pressure # If GCT/uptime > 5% β GC overhead problem # ============================================================ # JSTACK β Thread Dump # ============================================================ # Full thread dump jstack $PID > /tmp/threads.txt # Thread dump with lock information jstack -l $PID > /tmp/threads_locked.txt # Count threads by state (useful for thread leak detection) jstack $PID | grep "java.lang.Thread.State" | sort | uniq -c | sort -rn # ============================================================ # KUBERNETES / CONTAINER COMMANDS # ============================================================ # Pod memory usage kubectl top pod <pod-name> --containers # Pod memory limits and usage kubectl describe pod <pod-name> | grep -A 10 "Limits\|Requests" # Container OOM kill events kubectl get events --field-selector reason=OOMKilling # Exec into running container kubectl exec -it <pod-name> -- /bin/bash # Check container memory limit from inside container cat /sys/fs/cgroup/memory/memory.limit_in_bytes # cgroup v1 cat /sys/fs/cgroup/memory.max # cgroup v2 # Check container memory usage from inside container cat /sys/fs/cgroup/memory/memory.usage_in_bytes # cgroup v1 cat /sys/fs/cgroup/memory.current # cgroup v2 # ============================================================ # QUICK DIAGNOSTIC SEQUENCE (Run this for any OOM) # ============================================================ echo "=== Quick JVM Memory Diagnostic ===" echo "PID: $PID" echo "" echo "--- 1. JVM Flags ---" jcmd $PID VM.flags -all | grep -E '(MaxHeap|MaxMetaspace|MaxDirect|ThreadStack|GC)' echo "" echo "--- 2. Native Memory Summary ---" jcmd $PID VM.native_memory summary echo "" echo "--- 3. Heap Histogram (top 15) ---" jmap -histo:live $PID | head -15 echo "" echo "--- 4. GC Status ---" jstat -gcutil $PID 1000 5 echo "" echo "--- 5. Thread Count ---" jcmd $PID Thread.print | grep "java.lang.Thread.State" | wc -l echo "" echo "=== Diagnostic Complete ==="
- jcmd $PID VM.native_memory summary β shows where all JVM memory is going (heap, threads, metaspace, direct).
- jmap -histo:live $PID | head -30 β shows top 30 object types by count and size. Fast, no heap dump needed.
- jstat -gcutil $PID 1000 β shows GC activity in real-time. Old gen growing = leak. Full GC count rising = pressure.
- jcmd $PID GC.heap_dump /tmp/heap.hprof β full heap dump for MAT analysis. Pauses JVM β use only when needed.
- jstack $PID β thread dump for StackOverflowError and ThreadLocal leak detection.
Understanding the Five OOM Types
Most developers treat OOM as a single error. It is not. The JVM has five distinct OOM conditions, each with different causes, diagnostics, and fixes. Treating them interchangeably leads to misdiagnosis.
Java heap space β the most common. The heap (young gen + old gen) is full and GC cannot free enough space. Almost always a memory leak or undersized heap.
Metaspace β class metadata storage is full. Common in hot-redeploy environments where classloaders accumulate. Rarely a sizing issue β almost always a classloader leak.
Direct buffer memory β off-heap NIO buffer allocation failed. Common in Netty, gRPC, and NIO-based services. Usually a buffer leak or insufficient MaxDirectMemorySize.
GC overhead limit exceeded β GC is running continuously and recovering almost nothing. The JVM's way of saying 'I tried GC, it did not help, you have a leak.' This is a leak indicator, not a sizing issue.
Stack overflow β thread call stack exceeded -Xss. Not a memory leak β it is a recursion depth problem. But it manifests as an OOM in monitoring.
The critical insight: each OOM type requires a different diagnostic approach. A heap dump does not help with Metaspace OOM. Increasing -Xmx does not fix direct buffer memory OOM. Matching the OOM type to the correct diagnostic tool is the first step.
Production edge case: some OOM types are caught by the JVM (heap space, metaspace), while others kill the process externally. Container OOM killer (exit code 137) bypasses the JVM entirely β no heap dump, no error message, just a dead process. This is why container memory limits must account for off-heap usage.
Performance implication: each OOM type has different latency characteristics. Heap OOM causes gradual degradation (GC pauses increase). Metaspace OOM is sudden (class loading fails). Direct memory OOM is sudden (buffer allocation fails). Stack overflow is immediate (thread dies). Understanding the failure mode helps you detect it earlier.
package io.thecodeforge.monitoring; import java.lang.management.ManagementFactory; import java.lang.management.MemoryMXBean; import java.lang.management.MemoryUsage; import java.lang.management.MemoryPoolMXBean; import java.lang.management.GarbageCollectorMXBean; import java.util.List; import java.util.Map; import java.util.HashMap; /** * OOM Type Detector β identifies which memory region is at risk * before an OOM occurs. */ public class OomTypeDetector { private static final double HEAP_WARNING_THRESHOLD = 0.80; private static final double HEAP_CRITICAL_THRESHOLD = 0.90; private static final double METASPACE_WARNING_THRESHOLD = 0.80; public enum RiskLevel { HEALTHY, WARNING, CRITICAL, IMMINENT } public enum OomType { HEAP_SPACE, METASPACE, DIRECT_BUFFER, GC_OVERHEAD, STACK_OVERFLOW, CONTAINER_LIMIT } public static class MemoryRiskReport { public RiskLevel heapRisk; public RiskLevel metaspaceRisk; public RiskLevel gcOverheadRisk; public Map<OomType, String> recommendations; public long heapUsedMB; public long heapMaxMB; public long metaspaceUsedMB; public long metaspaceMaxMB; public double gcTimePercent; public MemoryRiskReport() { recommendations = new HashMap<>(); } } public static MemoryRiskReport analyze() { MemoryRiskReport report = new MemoryRiskReport(); MemoryMXBean memBean = ManagementFactory.getMemoryMXBean(); // Heap analysis MemoryUsage heapUsage = memBean.getHeapMemoryUsage(); report.heapUsedMB = heapUsage.getUsed() / (1024 * 1024); report.heapMaxMB = heapUsage.getMax() / (1024 * 1024); double heapPercent = (double) heapUsage.getUsed() / heapUsage.getMax(); if (heapPercent >= HEAP_CRITICAL_THRESHOLD) { report.heapRisk = RiskLevel.CRITICAL; report.recommendations.put(OomType.HEAP_SPACE, "Heap at " + (int)(heapPercent * 100) + "% β capture heap dump and analyze dominator tree."); } else if (heapPercent >= HEAP_WARNING_THRESHOLD) { report.heapRisk = RiskLevel.WARNING; report.recommendations.put(OomType.HEAP_SPACE, "Heap at " + (int)(heapPercent * 100) + "% β monitor growth rate."); } else { report.heapRisk = RiskLevel.HEALTHY; } // Metaspace analysis for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) { if (pool.getName().contains("Metaspace")) { MemoryUsage usage = pool.getUsage(); report.metaspaceUsedMB = usage.getUsed() / (1024 * 1024); report.metaspaceMaxMB = usage.getMax() > 0 ? usage.getMax() / (1024 * 1024) : -1; if (report.metaspaceMaxMB > 0) { double metaPercent = (double) usage.getUsed() / usage.getMax(); if (metaPercent >= METASPACE_WARNING_THRESHOLD) { report.metaspaceRisk = RiskLevel.WARNING; report.recommendations.put(OomType.METASPACE, "Metaspace at " + (int)(metaPercent * 100) + "% β check for classloader leaks."); } else { report.metaspaceRisk = RiskLevel.HEALTHY; } } } } // GC overhead analysis long totalGcTimeMs = 0; long totalGcCount = 0; for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) { totalGcTimeMs += gc.getCollectionTime(); totalGcCount += gc.getCollectionCount(); } long uptimeMs = ManagementFactory.getRuntimeMXBean().getUptime(); report.gcTimePercent = (double) totalGcTimeMs / uptimeMs * 100; if (report.gcTimePercent > 5.0) { report.gcOverheadRisk = RiskLevel.CRITICAL; report.recommendations.put(OomType.GC_OVERHEAD, "GC consuming " + String.format("%.1f", report.gcTimePercent) + "% of uptime β likely memory leak. Capture heap dump."); } else if (report.gcTimePercent > 2.0) { report.gcOverheadRisk = RiskLevel.WARNING; } else { report.gcOverheadRisk = RiskLevel.HEALTHY; } return report; } }
- Heap space: heap dump (jmap, -XX:+HeapDumpOnOutOfMemoryError). Look at dominator tree for leak suspects.
- Metaspace: classloader analysis (jcmd VM.classloader_stats). Look for classloaders with high class count that should have been unloaded.
- Direct buffer: NativeMemoryTracking (-XX:NativeMemoryTracking=detail, jcmd VM.native_memory). Look for buffer allocation without corresponding release.
- GC overhead: heap dump + GC log analysis. The leak is in old gen β look for objects that survive full GC.
- Stack overflow: thread dump (jstack). Look for repeating method signatures indicating infinite recursion.
Heap Dump Analysis: Finding the Leak
A heap dump is a snapshot of every object in the JVM heap at a point in time. It is the single most important diagnostic artifact for heap OOM. Without it, you are guessing. With it, you can identify the exact object, its reference chain to GC root, and its retained size.
The key concept is the dominator tree. In a heap dump, object A dominates object B if every path from GC roots to B goes through A. The dominator tree shows which objects retain the most memory. The top entries in the dominator tree are your leak suspects.
Eclipse MAT (Memory Analyzer Tool) is the standard tool for heap dump analysis. The three reports that matter most: Leak Suspects Report (automated analysis), Dominator Tree (manual exploration), and Histogram (object count by type).
The Leak Suspects Report is the starting point. It identifies objects with unusually high retained size and shows the reference chain from GC root. If the report identifies a single suspect consuming 60%+ of heap, you have found the leak.
But the automated report does not always find the leak. Some leaks are distributed β no single object dominates, but thousands of small objects accumulate. In this case, use the Histogram to find object types with unexpectedly high counts. Compare with a second heap dump taken 1 hour later. The type with the fastest-growing count is the leak source.
Production insight: always take at least two heap dumps, 30-60 minutes apart. A single dump shows the current state. Two dumps show the trend. The trend is what reveals leaks.
Heap dump caveat: taking a heap dump pauses the JVM (full stop-the-world) for the duration of the dump. For a 4GB heap, this can be 10-30 seconds. For a 32GB heap, it can be several minutes. Never take a heap dump on a production system during peak traffic without understanding the pause impact. Use jmap -dump:live,format=b,file=heap.hprof <pid> to force a full GC first and capture only live objects, reducing dump size.
Alternative for large heaps: use JFR allocation profiling (-XX:StartFlightRecording=settings=profile) to capture allocation patterns without a full heap dump. JFR adds less than 2% overhead and can run continuously in production. It does not show object graphs, but it shows which code is allocating the most memory.
Performance trade-off: heap dump pause time is proportional to live object count, not heap size. A 16GB heap with 2GB live objects dumps faster than an 8GB heap with 6GB live objects. Use -XX:+HeapDumpOnOutOfMemoryError (auto-dump on OOM) and -XX:HeapDumpPath=/var/log/jvm/ to ensure dumps are captured even during unattended failures.
package io.thecodeforge.diagnostics; import java.io.BufferedReader; import java.io.InputStreamReader; import java.lang.management.ManagementFactory; import java.lang.management.MemoryMXBean; import java.lang.management.MemoryUsage; import java.time.Instant; import java.util.ArrayList; import java.util.List; import java.util.concurrent.Executors; import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.TimeUnit; /** * Proactive heap monitor that captures dumps when memory * growth rate indicates a leak β before OOM occurs. */ public class ProactiveHeapMonitor { private final ScheduledExecutorService scheduler; private final List<Snapshot> history; private final long maxHeapMB; private final double growthRateThresholdMBPerHour; private final String dumpDirectory; public ProactiveHeapMonitor( long maxHeapMB, double growthRateThresholdMBPerHour, String dumpDirectory ) { this.maxHeapMB = maxHeapMB; this.growthRateThresholdMBPerHour = growthRateThresholdMBPerHour; this.dumpDirectory = dumpDirectory; this.history = new ArrayList<>(); this.scheduler = Executors.newSingleThreadScheduledExecutor( r -> { Thread t = new Thread(r, "heap-monitor"); t.setDaemon(true); return t; } ); } public void start(long intervalSeconds) { scheduler.scheduleAtFixedRate( this::checkMemory, intervalSeconds, intervalSeconds, TimeUnit.SECONDS ); } private void checkMemory() { MemoryMXBean memBean = ManagementFactory.getMemoryMXBean(); MemoryUsage heapUsage = memBean.getHeapMemoryUsage(); long usedMB = heapUsage.getUsed() / (1024 * 1024); Instant now = Instant.now(); history.add(new Snapshot(now, usedMB)); // Keep only last 24 hours of snapshots Instant cutoff = now.minusSeconds(86400); history.removeIf(s -> s.timestamp.isBefore(cutoff)); // Check absolute threshold double usagePercent = (double) usedMB / maxHeapMB; if (usagePercent > 0.85) { logWarning("Heap usage at " + (int)(usagePercent * 100) + "% (" + usedMB + "MB / " + maxHeapMB + "MB)"); if (usagePercent > 0.90) { captureHeapDump("high-usage-" + now.getEpochSecond()); } } // Check growth rate (leak detection) if (history.size() >= 2) { Snapshot oldest = history.get(0); Snapshot newest = history.get(history.size() - 1); double hoursElapsed = (newest.timestamp.toEpochMilli() - oldest.timestamp.toEpochMilli()) / 3_600_000.0; if (hoursElapsed > 0.5) { double growthRateMBPerHour = (newest.usedMB - oldest.usedMB) / hoursElapsed; if (growthRateMBPerHour > growthRateThresholdMBPerHour) { logWarning("Heap growth rate: " + growthRateMBPerHour + " MB/hour β possible leak"); captureHeapDump("leak-suspect-" + now.getEpochSecond()); } } } } private void captureHeapDump(String label) { String filename = dumpDirectory + "/heap-" + label + ".hprof"; try { String pid = ManagementFactory.getRuntimeMXBean().getName() .split("@")[0]; ProcessBuilder pb = new ProcessBuilder( "jmap", "-dump:live,format=b,file=" + filename, pid ); pb.redirectErrorStream(true); Process p = pb.start(); int exitCode = p.waitFor(); if (exitCode == 0) { logWarning("Heap dump captured: " + filename); } else { logWarning("Heap dump failed with exit code: " + exitCode); } } catch (Exception e) { logWarning("Heap dump failed: " + e.getMessage()); } } private void logWarning(String message) { System.err.println("[HeapMonitor] " + Instant.now() + " " + message); } private static class Snapshot { final Instant timestamp; final long usedMB; Snapshot(Instant timestamp, long usedMB) { this.timestamp = timestamp; this.usedMB = usedMB; } } }
- Single dump: shows what is in the heap now. Useful for finding large objects. Cannot distinguish leak from legitimate usage.
- Two dumps: shows what is growing. The object type with the fastest-growing count is the leak source.
- Dominator tree: shows which objects retain the most memory. Top entries are leak suspects.
- Leak Suspects Report: automated MAT analysis. Good starting point. Fails on distributed leaks (many small objects).
- Histogram comparison: export histograms from both dumps, diff them. The type with the largest count increase is the leak.
GC Tuning: Collector Selection and Parameter Optimization
GC tuning is about trade-offs: throughput vs latency, pause time vs frequency, memory efficiency vs allocation speed. There is no universal best setting β the right configuration depends on your workload profile.
The four production GC collectors:
G1GC (default since JDK 9): balanced throughput and latency. Good default for most services. Tuning targets: -XX:MaxGCPauseMillis (default 200ms), -XX:G1HeapRegionSize, -XX:InitiatingHeapOccupancyPercent.
ZGC (JDK 15+): sub-millisecond pause times regardless of heap size. Best for latency-sensitive services (trading, real-time). Trade-off: slightly lower throughput, higher CPU usage for concurrent GC threads.
Shenandoah (JDK 12+): similar to ZGC β low pause times, concurrent compaction. Trade-off: same as ZGC. Choose based on JDK vendor support.
Parallel GC: highest throughput, longest pauses. Best for batch processing where latency does not matter. Not recommended for interactive services.
The most common GC tuning mistake: switching collectors without understanding the workload. A team switched from G1GC to ZGC because they read it was 'faster.' Their service was a batch ETL pipeline that did not care about pause times. ZGC's extra CPU overhead reduced throughput by 8% for zero benefit.
Rule of thumb: if your service is latency-sensitive (p99 < 100ms), use ZGC or Shenandoah. If throughput matters more than latency, use Parallel GC. For everything else, G1GC is the right default.
Humongous allocations are a G1GC-specific problem. Objects larger than 50% of a G1 region (default region size is ~1-2MB depending on heap) are classified as humongous. They are allocated in contiguous regions and only reclaimed during full GC. If your service allocates many large byte arrays or StringBuilders, humongous allocations cause premature old gen promotion and full GC storms.
Fix: increase -XX:G1HeapRegionSize to reduce humongous threshold, or refactor code to avoid large contiguous allocations. Check GC logs for 'Humongous allocation' lines.
GC log analysis is essential. Enable GC logging with -Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M (JDK 11+). Key metrics to monitor: GC pause duration (max, p99, p95), GC frequency (pauses per minute), allocation rate (MB/sec), promotion rate (young gen to old gen MB/sec), and old gen usage after GC.
Production insight: the most impactful GC parameter is often not the collector itself, but the heap size relative to live data. If your live data set is 2GB and your heap is 8GB, GC has plenty of room to work. If your live data set is 6GB and your heap is 8GB, GC is constantly under pressure. Right-sizing the heap matters more than collector selection.
Edge case: containerized JVMs with cgroup memory limits. Prior to JDK 10, the JVM did not respect cgroup limits and would set heap based on host memory. JDK 10+ respects cgroup limits. Always verify with -XX:+PrintFlagsFinal | grep MaxHeapSize that the JVM sees the correct memory limit.
package io.thecodeforge.monitoring; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; import java.time.Duration; import java.util.ArrayList; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * GC Log Analyzer β parses GC logs and extracts key metrics * for production tuning decisions. */ public class GcLogAnalyzer { // JDK 11+ unified GC log format private static final Pattern GC_PAUSE_PATTERN = Pattern.compile( "\\[(?<timestamp>[\\d-T:.]+)\\]\\[(?<uptime>[\\d.]+)s\\]\\[(?<level>\\w+)\\]" + ".*GC\\((?<gcId>\\d+)\\) Pause (?<type>Young|Full|Mixed)" + ".*?(?<durationMs>[\\d.]+)ms" ); private static final Pattern HEAP_PATTERN = Pattern.compile( "(?<used>\\d+)K->(?<after>\\d+)K\\((?<total>\\d+)K\\)" ); public static class GcMetrics { public int totalGcPauses; public int youngGcCount; public int fullGcCount; public int mixedGcCount; public double maxPauseMs; public double p99PauseMs; public double p95PauseMs; public double avgPauseMs; public double totalPauseMs; public double gcTimePercent; public long maxHeapUsedKB; public long minHeapAfterGcKB; public List<Double> pauseTimes = new ArrayList<>(); } public static GcMetrics analyze(String gcLogFile) throws IOException { GcMetrics metrics = new GcMetrics(); try (BufferedReader reader = new BufferedReader( new FileReader(gcLogFile))) { String line; while ((line = reader.readLine()) != null) { Matcher pauseMatcher = GC_PAUSE_PATTERN.matcher(line); if (pauseMatcher.find()) { double duration = Double.parseDouble( pauseMatcher.group("durationMs")); String type = pauseMatcher.group("type"); metrics.totalGcPauses++; metrics.pauseTimes.add(duration); metrics.totalPauseMs += duration; switch (type) { case "Young": metrics.youngGcCount++; break; case "Full": metrics.fullGcCount++; break; case "Mixed": metrics.mixedGcCount++; break; } if (duration > metrics.maxPauseMs) { metrics.maxPauseMs = duration; } } Matcher heapMatcher = HEAP_PATTERN.matcher(line); if (heapMatcher.find()) { long used = Long.parseLong(heapMatcher.group("used")); long after = Long.parseLong(heapMatcher.group("after")); if (used > metrics.maxHeapUsedKB) { metrics.maxHeapUsedKB = used; } if (metrics.minHeapAfterGcKB == 0 || after < metrics.minHeapAfterGcKB) { metrics.minHeapAfterGcKB = after; } } } } // Calculate percentiles if (!metrics.pauseTimes.isEmpty()) { metrics.pauseTimes.sort(Double::compareTo); int size = metrics.pauseTimes.size(); metrics.avgPauseMs = metrics.totalPauseMs / size; metrics.p95PauseMs = metrics.pauseTimes.get((int)(size * 0.95)); metrics.p99PauseMs = metrics.pauseTimes.get((int)(size * 0.99)); } return metrics; } public static String generateReport(GcMetrics m) { StringBuilder sb = new StringBuilder(); sb.append("=== GC Analysis Report ===\n"); sb.append("Total GC pauses: ").append(m.totalGcPauses).append("\n"); sb.append("Young GC: ").append(m.youngGcCount).append("\n"); sb.append("Full GC: ").append(m.fullGcCount).append("\n"); sb.append("Mixed GC: ").append(m.mixedGcCount).append("\n"); sb.append("Max pause: ").append(m.maxPauseMs).append(" ms\n"); sb.append("P99 pause: ").append(m.p99PauseMs).append(" ms\n"); sb.append("P95 pause: ").append(m.p95PauseMs).append(" ms\n"); sb.append("Avg pause: ").append(String.format("%.2f", m.avgPauseMs)).append(" ms\n"); sb.append("Max heap used: ").append(m.maxHeapUsedKB / 1024).append(" MB\n"); sb.append("Min heap after GC: ").append(m.minHeapAfterGcKB / 1024).append(" MB\n"); // Warnings if (m.fullGcCount > 0) { sb.append("WARNING: Full GC detected β investigate old gen pressure\n"); } if (m.p99PauseMs > 200) { sb.append("WARNING: P99 pause > 200ms β consider ZGC or Shenandoah\n"); } if (m.minHeapAfterGcKB > 0) { long liveDataMB = m.minHeapAfterGcKB / 1024; sb.append("INFO: Live data set ~").append(liveDataMB).append(" MB\n"); sb.append("INFO: Recommended heap (2x live data): ") .append(liveDataMB * 2).append(" MB\n"); } return sb.toString(); } }
- Throughput (Parallel GC): minimize time spent in GC relative to application work. Best for batch processing. Long pauses are acceptable.
- Latency (ZGC/Shenandoah): minimize individual GC pause times. Best for real-time services. Higher CPU overhead is acceptable.
- Memory efficiency (G1GC): balance between throughput and latency with moderate memory overhead. Best default for most services.
- Humongous allocations: objects >50% of G1 region size cause full GC. Increase region size or refactor large allocations.
- Container awareness: JDK 10+ respects cgroup limits. Always verify with PrintFlagsFinal. Pre-JDK 10 ignores container memory limits.
Memory Leak Patterns and Detection
Memory leaks in Java are objects that are no longer needed but remain referenced, preventing garbage collection. Unlike C/C++ leaks (freed memory), Java leaks are reachable objects that should be unreachable.
The five most common leak patterns in production:
Unbounded collections β Maps, Lists, or Sets that grow without limit. The #1 cause of heap OOM. Fix: use bounded caches (Caffeine, Guava) with TTL and maximumSize.
Listener/callback registration without deregistration β registering event listeners that hold references to the subscriber object. When the subscriber should be GC'd, the listener reference keeps it alive. Fix: always deregister in close()/destroy() methods.
ThreadLocal without cleanup β ThreadLocal values persist for the lifetime of the thread. In thread pools, threads live forever. ThreadLocal values accumulate indefinitely. Fix: call threadLocal.remove() in a finally block after use.
ClassLoader leaks β in hot-redeploy environments, old classloaders remain referenced by static fields or thread-locals. The classloader cannot be GC'd, and neither can all classes it loaded. Fix: avoid static references to classes from dynamic classloaders. Use WeakReference or ServiceLoader patterns.
String.intern() abuse β String.intern() stores strings in the string pool (native memory pre-JDK 7, heap post-JDK 7). Interning user-generated strings creates an unbounded pool. Fix: never intern user input. Use a bounded cache with eviction instead.
Detection strategy: the sawtooth test. Monitor heap usage over time. A healthy JVM shows a sawtooth pattern β heap rises during allocation, drops after GC, returns to the same baseline. A leak shows the same sawtooth, but the baseline after GC increases over time. The post-GC baseline is the key metric.
Production tool: Java Flight Recorder (JFR) with allocation profiling. JFR records every significant allocation with the call stack. Enable with -XX:StartFlightRecording=settings=profile,duration=60s,filename=alloc.jfr. Analyze with JDK Mission Control (JMC) β the 'Allocation by Thread' and 'Allocation by Class' views show where memory is being allocated.
Edge case: soft reference accumulation. The JVM collects SoftReferences only when heap pressure is high. If your cache uses SoftReferences, it will consume all available heap before releasing entries. This is by design, but it makes heap appear full even when it is not leaking. Switch to WeakReference or use a proper cache library with size-based eviction.
Performance consideration: leak detection tools (JFR, MAT) add overhead. JFR adds <2% CPU overhead and can run continuously. MAT analysis requires a heap dump, which pauses the JVM. Use JFR for continuous monitoring and MAT for post-mortem analysis.
package io.thecodeforge.diagnostics; import java.lang.management.ManagementFactory; import java.lang.management.MemoryMXBean; import java.lang.management.MemoryPoolMXBean; import java.lang.management.MemoryUsage; import java.lang.management.GarbageCollectorMXBean; import java.time.Instant; import java.util.ArrayList; import java.util.List; import java.util.concurrent.Executors; import java.util.concurrent.ScheduledExecutorService; import java.util.concurrent.TimeUnit; /** * Memory Leak Detector β monitors old gen growth rate * to detect leaks before OOM occurs. * * Core insight: a leak shows as increasing old gen usage * after each full GC. The post-GC baseline is the key metric. */ public class MemoryLeakDetector { private final ScheduledExecutorService scheduler; private final List<OldGenSnapshot> snapshots; private final double alertThresholdMBPerHour; private final LeakAlertHandler alertHandler; public interface LeakAlertHandler { void onLeakDetected(double growthRateMBPerHour, long currentOldGenMB, String recommendation); } public MemoryLeakDetector( double alertThresholdMBPerHour, LeakAlertHandler alertHandler ) { this.alertThresholdMBPerHour = alertThresholdMBPerHour; this.alertHandler = alertHandler; this.snapshots = new ArrayList<>(); this.scheduler = Executors.newSingleThreadScheduledExecutor( r -> { Thread t = new Thread(r, "leak-detector"); t.setDaemon(true); return t; } ); } public void start(long intervalSeconds) { scheduler.scheduleAtFixedRate( this::sampleOldGen, intervalSeconds, intervalSeconds, TimeUnit.SECONDS ); } private void sampleOldGen() { long oldGenUsedMB = getOldGenUsedMB(); Instant now = Instant.now(); snapshots.add(new OldGenSnapshot(now, oldGenUsedMB)); // Keep only last 6 hours Instant cutoff = now.minusSeconds(21600); snapshots.removeIf(s -> s.timestamp.isBefore(cutoff)); // Need at least 30 minutes of data if (snapshots.size() < 6) return; // Calculate growth rate OldGenSnapshot oldest = snapshots.get(0); OldGenSnapshot newest = snapshots.get(snapshots.size() - 1); double hoursElapsed = (newest.timestamp.toEpochMilli() - oldest.timestamp.toEpochMilli()) / 3_600_000.0; if (hoursElapsed < 0.5) return; double growthRateMBPerHour = (newest.usedMB - oldest.usedMB) / hoursElapsed; if (growthRateMBPerHour > alertThresholdMBPerHour) { String recommendation = buildRecommendation( growthRateMBPerHour, newest.usedMB); alertHandler.onLeakDetected( growthRateMBPerHour, newest.usedMB, recommendation); } } private long getOldGenUsedMB() { for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) { String name = pool.getName(); if (name.contains("Old") || name.contains("Tenured")) { return pool.getUsage().getUsed() / (1024 * 1024); } } // Fallback: use heap usage MemoryMXBean memBean = ManagementFactory.getMemoryMXBean(); return memBean.getHeapMemoryUsage().getUsed() / (1024 * 1024); } private String buildRecommendation( double growthRateMBPerHour, long currentOldGenMB ) { StringBuilder sb = new StringBuilder(); sb.append("Memory leak detected. "); sb.append("Growth rate: ").append(String.format("%.1f", growthRateMBPerHour)); sb.append(" MB/hour. "); sb.append("Current old gen: ").append(currentOldGenMB).append(" MB. "); sb.append("Actions: "); sb.append("1) Capture heap dump (jmap -dump:live,format=b,file=heap.hprof). "); sb.append("2) Analyze with MAT β check dominator tree and histogram. "); sb.append("3) Compare with previous histogram to find growing object types."); return sb.toString(); } private static class OldGenSnapshot { final Instant timestamp; final long usedMB; OldGenSnapshot(Instant timestamp, long usedMB) { this.timestamp = timestamp; this.usedMB = usedMB; } } }
- Healthy pattern: heap rises to 4GB, GC brings it back to 1.5GB. Next cycle: rises to 4GB, back to 1.5GB. Baseline is stable.
- Leak pattern: heap rises to 4GB, GC brings it to 1.5GB. Next cycle: rises to 4GB, back to 1.8GB. Next: back to 2.1GB. Baseline is rising.
- Key metric: old gen usage after full GC. Monitor this, not peak heap usage.
- Detection: take snapshots every 30 seconds. Calculate growth rate of post-GC baseline. Alert if >5% per hour.
- False positive: legitimate cache growth (new data being cached) looks like a leak. Distinguish by checking if the growth stabilizes.
Production JVM Configuration: Flags That Matter
JVM configuration is where most memory incidents are prevented β or caused. The wrong flags make debugging impossible. The right flags make it trivial.
Non-negotiable production flags:
-XX:+HeapDumpOnOutOfMemoryError β captures a heap dump when OOM occurs. Without this, you have no diagnostic data after the crash. Set -XX:HeapDumpPath to a persistent directory (not /tmp in containers β /tmp is often tmpfs and too small).
-Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M β enables GC logging with rotation. Essential for diagnosing GC issues. JDK 11+ syntax. For JDK 8, use -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log.
-XX:+ExitOnOutOfMemoryError β kills the JVM immediately on OOM instead of leaving it in an undefined state. In containerized environments, this ensures the container restarts via the orchestrator. Without this, the JVM may continue running in a degraded state, accepting requests it cannot process.
-XX:MaxRAMPercentage=70.0 β sets max heap as a percentage of container memory. Alternative to -Xmx for containerized deployments. Automatically adjusts when container limits change. Use 70-75% to leave room for off-heap.
Container memory calculation: Container memory = heap (Xmx) + metaspace + thread stacks (Xss Γ thread count) + direct memory (MaxDirectMemorySize) + native memory (JNI) + OS overhead.
Rule of thumb: set container memory limit to 1.3-1.5x your -Xmx value. For a 4GB heap, set container limit to 5.2-6GB. This covers metaspace (~100-200MB), thread stacks (200 threads Γ 1MB = 200MB), direct memory (~256MB), and OS overhead (~500MB).
Thread stack sizing: -Xss sets stack size per thread. Default is 512KB-1MB depending on OS. For services with many threads, this matters. 500 threads Γ 1MB = 500MB of stack memory. If your call depth is shallow, reduce to -Xss256k. If you have deep recursion, increase to -Xss2m.
Metaspace sizing: -XX:MaxMetaspaceSize limits metaspace growth. Without this limit, metaspace can consume all available native memory. Set it to a reasonable value (256MB-512MB for most services). If you hit the limit, it indicates a classloader leak, not insufficient space.
JFR continuous recording: -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h β enables continuous JFR recording with rolling buffer. When an incident occurs, dump the recording with jcmd <pid> JFR.dump. This gives you allocation, GC, and lock profiling data without restarting the service.
Edge case: -XX:+UseCompressedOops is enabled by default for heaps <32GB. It compresses object pointers from 8 bytes to 4 bytes, saving ~20% heap. Above 32GB, compressed oops are disabled and each object pointer costs 8 bytes. This means a 34GB heap may perform worse than a 31GB heap due to pointer size increase. Either stay under 32GB or go significantly above (40GB+).
#!/bin/bash # Production JVM flags for containerized Java services # Tested on JDK 17 with G1GC and ZGC configurations # ============================================================ # BASELINE CONFIGURATION (G1GC β suitable for most services) # ============================================================ JVM_BASE_FLAGS=" # Memory -XX:MaxRAMPercentage=70.0 -XX:InitialRAMPercentage=50.0 -XX:MaxMetaspaceSize=256m -Xss512k # GC β G1GC -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:G1HeapRegionSize=4m -XX:InitiatingHeapOccupancyPercent=45 # Diagnostics (non-negotiable) -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/jvm/heapdump.hprof -XX:+ExitOnOutOfMemoryError -XX:+CrashOnOutOfMemoryError # GC Logging (JDK 11+) -Xlog:gc*:file=/var/log/jvm/gc.log:time,uptime,level,tags:filecount=5,filesize=100m # JFR continuous recording -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h,name=continuous,filename=/var/log/jvm/recording.jfr # Compressed oops (auto-enabled <32GB heap) -XX:+UseCompressedOops -XX:+UseCompressedClassPointers " # ============================================================ # LOW-LATENCY CONFIGURATION (ZGC β for p99 < 10ms services) # ============================================================ JVM_ZGC_FLAGS=" # Memory -XX:MaxRAMPercentage=70.0 -XX:MaxMetaspaceSize=256m -Xss512k # GC β ZGC -XX:+UseZGC -XX:+ZGenerational # JDK 21+ generational ZGC -XX:ConcGCThreads=4 -XX:ParallelGCThreads=8 # Diagnostics -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/jvm/heapdump.hprof -XX:+ExitOnOutOfMemoryError -Xlog:gc*:file=/var/log/jvm/gc.log:time,uptime,level,tags:filecount=5,filesize=100m -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h,name=continuous " # ============================================================ # CONTAINER MEMORY CALCULATION # ============================================================ # # For a 4GB heap (-XX:MaxRAMPercentage=70.0 on a 5.7GB container): # # Heap: 4000 MB (70% of 5700MB) # Metaspace: 256 MB (MaxMetaspaceSize) # Thread stacks: 200 MB (400 threads Γ 512KB) # Direct memory: 256 MB (default = Xmx) # GC overhead: 200 MB (G1GC bookkeeping) # Native/JNI: 300 MB (JNI libraries, socket buffers) # OS overhead: 500 MB (page cache, file descriptors) # ---------------------------------------- # Total: 5712 MB (container limit: 5.7GB) # # Formula: Container = Xmx Γ 1.43 (round up to nearest 256MB) # ============================================================ echo "JVM flags configured for production deployment" echo "Container memory recommendation: Xmx Γ 1.43"
- Heap (Xmx): 70% of container memory. This is your working memory for objects.
- Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. Reduce with -Xss256k if call depth is shallow.
- Metaspace: 100-256MB for most services. Set MaxMetaspaceSize to prevent runaway growth.
- Direct memory: default equals Xmx. Set MaxDirectMemorySize explicitly if using NIO/Netty.
- OS overhead: 300-500MB for page cache, file descriptors, socket buffers. Never allocate 100% of container memory to JVM.
Off-Heap Memory: Direct Buffers, Native Memory, and Thread Stacks
Most JVM memory guides focus exclusively on heap. In production, off-heap memory causes at least 30% of OOM incidents. The container OOM killer does not care whether the memory is heap or off-heap β it kills when total usage exceeds the limit.
Direct ByteBuffer β allocated via ByteBuffer.allocateDirect(). Lives outside the heap in native memory. Used by NIO channels, Netty, gRPC, and file I/O. The JVM tracks direct buffer usage against -XX:MaxDirectMemorySize (default = -Xmx). If direct buffer allocation exceeds this limit, you get OOM: Direct buffer memory.
The insidious part: direct buffers are freed by a ReferenceQueue-based cleaner, not immediately when the buffer is GC'd. If the application allocates direct buffers faster than the GC and cleaner can reclaim them, you get OOM even though the buffers are technically unreachable. This is a rate problem, not a leak problem.
Thread stacks β each thread has a stack of size -Xss. Default is 512KB-1MB. 500 threads Γ 1MB = 500MB. This memory is allocated at thread creation and never shrinks. In services with dynamic thread pools, thread count can grow under load, consuming more stack memory.
Metaspace β class metadata storage. Replaces PermGen (JDK 7). Grows as classes are loaded. Bounded by -XX:MaxMetaspaceSize. Unbounded by default β can consume all native memory if not limited.
JNI native memory β memory allocated by native libraries via JNI. The JVM does not track this. Common sources: database drivers (OCI, native JDBC), compression libraries (zlib, snappy), and cryptographic providers. Use NativeMemoryTracking to estimate.
MappedByteBuffer β file-backed memory mapping via FileChannel.map(). Maps file contents directly into process address space. Not counted against heap or MaxDirectMemorySize. Large memory-mapped files can trigger container OOM.
Diagnosis tool: NativeMemoryTracking (NMT). Enable with -XX:NativeMemoryTracking=detail. Query with jcmd <pid> VM.native_memory summary. NMT shows memory breakdown by category: Java Heap, Class (metaspace), Thread, Code, GC, Internal, Symbol, Malloc, and Mapped.
Performance caveat: NMT adds 5-10% overhead in detail mode. Use -XX:NativeMemoryTracking=summary for production (1-2% overhead). Switch to detail mode only during active debugging.
Edge case: Netty's PooledByteBufAllocator recycles direct buffers to avoid allocation overhead. If the pool grows under load, it retains memory even after the buffers are released. Monitor Netty's pool metrics (PooledByteBufAllocator.metric()) to detect pool bloat.
package io.thecodeforge.monitoring; import java.lang.management.ManagementFactory; import java.lang.management.MemoryPoolMXBean; import java.lang.management.MemoryUsage; import java.lang.management.ThreadMXBean; import java.nio.ByteBuffer; import java.util.HashMap; import java.util.Map; /** * Off-Heap Memory Monitor β tracks memory usage outside * the JVM heap that contributes to container OOM kills. */ public class OffHeapMonitor { public static class OffHeapReport { public long metaspaceUsedMB; public long metaspaceMaxMB; public long threadStackMB; public int threadCount; public long directMemoryMaxMB; public long compressedClassSpaceMB; public long codeCacheMB; public Map<String, String> recommendations = new HashMap<>(); public long totalOffHeapMB() { return metaspaceUsedMB + threadStackMB + compressedClassSpaceMB + codeCacheMB; } } public static OffHeapReport analyze() { OffHeapReport report = new OffHeapReport(); // Metaspace for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) { String name = pool.getName(); MemoryUsage usage = pool.getUsage(); if (name.contains("Metaspace")) { report.metaspaceUsedMB = usage.getUsed() / (1024 * 1024); report.metaspaceMaxMB = usage.getMax() > 0 ? usage.getMax() / (1024 * 1024) : -1; } else if (name.contains("Compressed Class Space")) { report.compressedClassSpaceMB = usage.getUsed() / (1024 * 1024); } else if (name.contains("Code Cache")) { report.codeCacheMB = usage.getUsed() / (1024 * 1024); } } // Thread stacks ThreadMXBean threadBean = ManagementFactory.getThreadMXBean(); report.threadCount = threadBean.getThreadCount(); // Estimate: each thread uses -Xss (default ~1MB) // More accurate: check -XX:ThreadStackSize via VM flags report.threadStackMB = report.threadCount; // Rough estimate: 1MB per thread // Direct memory limit try { long maxDirectMemory = sun.misc.VM.maxDirectMemory(); report.directMemoryMaxMB = maxDirectMemory / (1024 * 1024); } catch (Exception e) { report.directMemoryMaxMB = -1; } // Recommendations if (report.metaspaceUsedMB > 200) { report.recommendations.put("metaspace", "Metaspace using " + report.metaspaceUsedMB + "MB β check for classloader leaks"); } if (report.threadCount > 300) { report.recommendations.put("threads", report.threadCount + " threads active β " + report.threadStackMB + "MB in stacks. " + "Consider reducing thread pool size or -Xss."); } long totalOffHeap = report.totalOffHeapMB(); if (totalOffHeap > 1024) { report.recommendations.put("total", "Total off-heap: " + totalOffHeap + "MB. " + "Ensure container memory limit accounts for this."); } return report; } /** * Monitor direct buffer allocation rate. * Call this periodically to detect direct memory pressure. */ public static long getDirectMemoryUsedEstimate() { // NMT is more accurate, but this gives a quick estimate // by attempting a small allocation and checking if it succeeds try { ByteBuffer test = ByteBuffer.allocateDirect(1024); test = null; return -1; // Allocation succeeded β no pressure } catch (OutOfMemoryError e) { return 0; // Direct memory exhausted } } }
- Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. This grows if thread pool scales up under load.
- Metaspace: 100-256MB typical. Unbounded by default. Set MaxMetaspaceSize to prevent runaway growth.
- Direct buffers: tracked by MaxDirectMemorySize. Default equals Xmx. Netty pools can retain memory even after release.
- Native memory: JNI libraries, socket buffers, file descriptors. Not tracked by JVM. Use NMT for estimates.
- MappedByteBuffer: file-backed mapping. Not counted against heap or direct memory. Large files can trigger container OOM.
Building a Production Memory Monitoring Stack
Memory incidents are preventable with the right monitoring. The goal is to detect problems hours before they cause OOM β not after.
Layer 1 β JVM metrics (Prometheus/JMX): Expose heap usage, GC pause times, GC count, thread count, and metaspace usage via JMX. Use Micrometer or JMX Exporter for Prometheus integration. Key alerts: - Old gen usage after GC > 70% for 10 minutes β warning - Old gen usage after GC > 85% for 5 minutes β critical - GC pause p99 > 500ms β warning - GC pause p99 > 2s β critical - Thread count > 80% of max pool size β warning - Full GC count > 0 in last hour β investigate
Layer 2 β Container metrics (cAdvisor/Kubernetes): Monitor container memory usage (not just JVM heap). Key alerts: - Container memory > 85% of limit β warning - Container memory > 95% of limit β critical (OOM imminent) - Container restart count > 0 in last hour β investigate
Layer 3 β Application-level metrics: Track object counts for known leak-prone structures: session cache size, connection pool size, thread-local count. These are domain-specific and catch leaks that JVM metrics miss.
Alerting philosophy: Alert on trends, not thresholds. A heap at 80% is fine if it returns to 40% after GC. A heap at 60% is a problem if it never drops below 55% after GC. The post-GC baseline trend is the most important metric.
Automated remediation: For containerized services, configure liveness probes that check heap usage. If heap exceeds 90%, the probe fails and Kubernetes restarts the pod. This is a safety net, not a fix β but it prevents the service from running in a degraded state while you investigate.
Retention and analysis: Keep GC logs and heap dumps for at least 7 days. Memory leaks can take days to manifest. If you only keep 24 hours of logs, you lose the trend data needed for diagnosis. Store dumps in object storage (S3, GCS) with lifecycle policies.
Production insight: the monitoring stack itself must not consume significant memory. A common mistake is running a heavy APM agent (100-200MB overhead) alongside the JVM. In a 2GB heap container, the agent consumes 5-10% of total memory. Use lightweight agents (JMX Exporter <20MB) or expose metrics via an HTTP endpoint without an agent.
package io.thecodeforge.monitoring; import com.sun.management.OperatingSystemMXBean; import java.lang.management.*; import java.util.HashMap; import java.util.Map; /** * Memory Metrics Exporter β exposes JVM memory metrics * for Prometheus/monitoring integration. * * Lightweight alternative to heavy APM agents. * Estimated overhead: <5MB heap, <0.1% CPU. */ public class MemoryMetricsExporter { public static class MemoryMetrics { // Heap public long heapUsedMB; public long heapMaxMB; public long heapCommittedMB; public double heapUsagePercent; // Young gen public long youngGenUsedMB; public long youngGenMaxMB; // Old gen public long oldGenUsedMB; public long oldGenMaxMB; public double oldGenUsagePercent; // Off-heap public long metaspaceUsedMB; public long metaspaceMaxMB; public long threadCount; public long threadStackEstimateMB; public long directMemoryMaxMB; // GC public long youngGcCount; public long youngGcTimeMs; public long fullGcCount; public long fullGcTimeMs; public double gcTimePercent; // Container public long containerMemoryLimitMB; public long processPhysicalMemoryMB; public double containerUsagePercent; } public static MemoryMetrics collect() { MemoryMetrics m = new MemoryMetrics(); MemoryMXBean memBean = ManagementFactory.getMemoryMXBean(); // Heap MemoryUsage heap = memBean.getHeapMemoryUsage(); m.heapUsedMB = heap.getUsed() / (1024 * 1024); m.heapMaxMB = heap.getMax() / (1024 * 1024); m.heapCommittedMB = heap.getCommitted() / (1024 * 1024); m.heapUsagePercent = (double) heap.getUsed() / heap.getMax() * 100; // Memory pools for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) { String name = pool.getName(); MemoryUsage usage = pool.getUsage(); if (name.contains("Eden") || name.contains("Survivor")) { m.youngGenUsedMB += usage.getUsed() / (1024 * 1024); if (usage.getMax() > 0) { m.youngGenMaxMB += usage.getMax() / (1024 * 1024); } } else if (name.contains("Old") || name.contains("Tenured")) { m.oldGenUsedMB = usage.getUsed() / (1024 * 1024); m.oldGenMaxMB = usage.getMax() > 0 ? usage.getMax() / (1024 * 1024) : 0; m.oldGenUsagePercent = m.oldGenMaxMB > 0 ? (double) m.oldGenUsedMB / m.oldGenMaxMB * 100 : 0; } else if (name.contains("Metaspace")) { m.metaspaceUsedMB = usage.getUsed() / (1024 * 1024); m.metaspaceMaxMB = usage.getMax() > 0 ? usage.getMax() / (1024 * 1024) : -1; } } // GC stats long uptimeMs = ManagementFactory.getRuntimeMXBean().getUptime(); for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) { String name = gc.getName(); if (name.contains("Young") || name.contains("Scavenge") || name.contains("G1 Young")) { m.youngGcCount = gc.getCollectionCount(); m.youngGcTimeMs = gc.getCollectionTime(); } else if (name.contains("Old") || name.contains("MarkSweep") || name.contains("G1 Old")) { m.fullGcCount = gc.getCollectionCount(); m.fullGcTimeMs = gc.getCollectionTime(); } } m.gcTimePercent = (double)(m.youngGcTimeMs + m.fullGcTimeMs) / uptimeMs * 100; // Threads ThreadMXBean threadBean = ManagementFactory.getThreadMXBean(); m.threadCount = threadBean.getThreadCount(); m.threadStackEstimateMB = m.threadCount; // ~1MB per thread estimate // Container / OS memory try { OperatingSystemMXBean osBean = (OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean(); long totalPhysical = osBean.getTotalPhysicalMemorySize(); long freePhysical = osBean.getFreePhysicalMemorySize(); m.processPhysicalMemoryMB = (totalPhysical - freePhysical) / (1024 * 1024); m.containerMemoryLimitMB = totalPhysical / (1024 * 1024); m.containerUsagePercent = (double) m.processPhysicalMemoryMB / m.containerMemoryLimitMB * 100; } catch (Exception e) { // Not available on all JVMs } return m; } public static String toPrometheusFormat(MemoryMetrics m) { StringBuilder sb = new StringBuilder(); sb.append("# HELP jvm_memory_heap_used_bytes JVM heap used\n"); sb.append("# TYPE jvm_memory_heap_used_bytes gauge\n"); sb.append("jvm_memory_heap_used_bytes ") .append(m.heapUsedMB * 1024 * 1024).append("\n\n"); sb.append("# HELP jvm_memory_old_gen_usage_percent Old gen usage\n"); sb.append("# TYPE jvm_memory_old_gen_usage_percent gauge\n"); sb.append("jvm_memory_old_gen_usage_percent ") .append(String.format("%.2f", m.oldGenUsagePercent)).append("\n\n"); sb.append("# HELP jvm_gc_full_count Full GC count\n"); sb.append("# TYPE jvm_gc_full_count counter\n"); sb.append("jvm_gc_full_count ").append(m.fullGcCount).append("\n\n"); sb.append("# HELP jvm_memory_container_usage_percent Container memory usage\n"); sb.append("# TYPE jvm_memory_container_usage_percent gauge\n"); sb.append("jvm_memory_container_usage_percent ") .append(String.format("%.2f", m.containerUsagePercent)).append("\n"); return sb.toString(); } }
- Layer 1 (JVM): heap usage, GC pauses, GC count, metaspace, thread count. Catches heap leaks and GC problems.
- Layer 2 (Container): total memory usage, restart count, OOM kill count. Catches off-heap issues that JVM metrics miss.
- Layer 3 (Application): session cache size, connection pool size, custom object counts. Catches domain-specific leaks.
- Alert on trends: post-GC old gen baseline rising = leak. Post-GC old gen stable = right-sizing issue.
- Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses trend data.
| Situation | Common Cause | Best Fix |
|---|---|---|
| OOM: Java heap space | Memory leak or undersized heap | Analyze heap dump with MAT. Find leak via dominator tree. Fix leak, then right-size heap. |
| OOM: Metaspace | ClassLoader leak in hot-redeploy environment | Restart JVM on redeploy. Avoid static references to dynamic classloaders. Use WeakHashMap. |
| OOM: Direct buffer memory | Netty/NIO buffer leak or insufficient MaxDirectMemorySize | Enable ResourceLeakDetector. Set MaxDirectMemorySize explicitly. Monitor with NMT. |
| GC overhead limit exceeded | Memory leak β GC cannot free enough memory | Analyze heap dump. Fix the leak. Increasing heap only delays the crash. |
| StackOverflowError | Infinite recursion or deep call stack | Convert recursion to iteration. Increase -Xss if deep recursion is intentional. |
| Container OOM kill (exit 137) | Total memory (heap + off-heap) exceeds container limit | Set container limit to 1.43x heap. Add NativeMemoryTracking. Monitor container memory. |
| GC pauses >1 second | Full GC on large heap with G1GC | Switch to ZGC (sub-ms pauses) or tune G1GC MaxGCPauseMillis and IHOP. |
| Memory grows but no single leak object | Distributed leak (ThreadLocal, unbounded cache) | Compare heap histograms over time. Check ThreadLocal.remove() and cache eviction. |
| OOM only at high traffic | Allocation rate exceeds GC throughput | Reduce allocation rate (object pooling, caching). Switch to higher-throughput GC. |
| OOM after code deployment | New code introduced leak or removed cleanup | Diff deployed code. Look for new caches, new ThreadLocal, removed eviction logic. |
| Heap at 80% but stable β no leak | Working set is legitimately large | Right-size heap. Working set Γ 2 is a good starting point. Not every high-usage is a leak. |
| Humongous allocations in GC logs | Objects >50% of G1 region size | Increase G1HeapRegionSize or refactor large byte[]/StringBuilder allocations. |
| SoftReference cache consuming all heap | JVM only collects SoftReferences under heap pressure | Switch to size-bounded cache (Caffeine) with explicit eviction. |
| Netty buffer pool growing unbounded | PooledByteBufAllocator retains buffers under load | Set maxOrder limit. Monitor pool metrics. Use -XX:MaxDirectMemorySize. |
π― Key Takeaways
- Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging.
- Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type.
- GC tuning is about trade-offs: throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC.
- Monitor post-GC old gen baseline, not peak heap usage. A rising baseline confirms a leak. Peak usage is irrelevant for leak detection.
- Set container memory to 1.43x your heap size. Off-heap memory (thread stacks, metaspace, direct buffers) is invisible to heap monitoring but visible to the container OOM killer.
- ThreadLocal and unbounded caches are the most common production leak sources. Always call ThreadLocal.remove() in a finally block. Always set maximumSize on caches.
- Three non-negotiable production flags: -XX:+HeapDumpOnOutOfMemoryError, GC logging, and -XX:+ExitOnOutOfMemoryError. Without them, you are flying blind.
- Enable NativeMemoryTracking to profile off-heap memory. Container OOM kills with normal heap usage indicate off-heap pressure.
- Netty's PooledByteBufAllocator retains direct buffers even after release. Monitor pool metrics and set explicit MaxDirectMemorySize.
- Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses the trend data needed for diagnosis.
- Print the symptom-to-tool map and the five essential commands. When the alert fires at 2 AM, you need to triage in 60 seconds, not 45 minutes.
- Run fast commands first (jmap -histo, jstat -gcutil). Run slow commands (jmap -dump) only when fast commands do not reveal the issue.
β Common Mistakes to Avoid
- βTreating all OOM errors the same β each type (heap, metaspace, direct, GC overhead, stack) has a different cause and fix.
- βNot setting -XX:+HeapDumpOnOutOfMemoryError β without it, you have zero diagnostic data after an OOM crash.
- βSetting -Xmx equal to container memory limit β the container OOM killer strikes before the JVM OOM handler, leaving no heap dump.
- βDoubling heap size without understanding the leak β this just delays the crash and makes the heap dump twice as large to analyze.
- βMonitoring peak heap usage instead of post-GC old gen baseline β peak usage depends on allocation rate and GC timing, not leak presence.
- βUsing plain HashMap for session storage with manual cleanup β use Caffeine or Guava cache with TTL and maximumSize.
- βCalling ThreadLocal.set() without ThreadLocal.remove() in thread pool environments β ThreadLocal values persist for the thread's lifetime.
- βUsing String.intern() on user-generated input β creates an unbounded string pool that grows with every unique string.
- βNot enabling GC logging in production β GC logs are essential for diagnosing pause time issues and allocation rate problems.
- βUsing a heavy APM agent in memory-constrained containers β 150MB agent overhead in a 2GB container is 7.5% of total memory.
- βSwitching GC collectors without understanding the workload β ZGC adds CPU overhead that is wasted on batch jobs that do not care about latency.
- βNot monitoring container memory alongside JVM heap β off-heap memory (thread stacks, metaspace, direct buffers) can be 30-50% of total usage.
- βKeeping GC logs and heap dumps for only 24 hours β memory leaks take days to manifest, requiring longer retention for trend analysis.
- βIgnoring humongous allocations in G1GC β objects >50% of region size cause premature full GC and performance degradation.
- βSetting -XX:MaxMetaspaceSize too low β Metaspace OOM is usually a classloader leak, not insufficient space. Fix the leak, not the limit.
- βNot accounting for compressed oops boundary at 32GB β a 34GB heap can perform worse than 31GB due to pointer size increase.
- βUsing SoftReference-based caches β JVM only collects SoftReferences under heap pressure, allowing them to consume all available memory.
- βRunning wrong diagnostic tool for the symptom β jmap does not help with direct memory, GC logs do not help with stack overflow.
- βRunning jmap -dump before jmap -histo β histogram is fast and often reveals the problem without needing the slow full dump.
- βNot having a standardized diagnostic script β ad-hoc debugging at 2 AM wastes 20+ minutes per incident.
Interview Questions on This Topic
- QWalk me through how you would debug an OOM: Java heap space error in a production service. What tools would you use, what would you look for, and how would you confirm the fix?
- QYour service is running in a Kubernetes pod with 4GB memory limit. The pod gets killed with exit code 137 every few hours, but your JVM heap monitoring shows usage never exceeds 60%. What is happening and how would you fix it?
- QExplain the difference between the five OOM types in the JVM. For each, what is the typical root cause and what diagnostic tool would you use?
- QYou need to reduce GC pause times from 500ms to under 10ms for a latency-sensitive trading service. What GC collector would you choose, what parameters would you tune, and what trade-offs would you accept?
- QA memory leak in production causes OOM every 20 hours. You take a heap dump but the dominator tree shows no single object dominating memory. How do you find the leak?
- QExplain the sawtooth pattern in JVM heap usage. How do you distinguish a healthy sawtooth from a memory leak? What metric do you monitor?
- QYour team wants to switch from G1GC to ZGC because they read it is 'faster.' What questions would you ask before approving the change, and what trade-offs would you explain?
- QDesign a memory monitoring stack for a fleet of 200 Java microservices running in Kubernetes. What metrics would you collect, what alerts would you set, and how would you keep overhead minimal?
- QA service uses Netty for HTTP handling. Container memory usage grows to 95% of limit but heap usage is only 50%. Diagnose the issue and explain the fix.
- QYou inherit a codebase with 50 ThreadLocal usages across the application. How would you audit them for leaks, and what patterns would you enforce to prevent ThreadLocal leaks in thread pool environments?
- QIt is 3 AM and you just received an OOM alert. You have 60 seconds before the on-call escalation. Walk me through the exact commands you would run and in what order.
- QYour JVM flags include -Xmx4g and the container memory limit is 4GB. Explain why this is wrong and how you would fix it.
Frequently Asked Questions
What is the difference between OOM: Java heap space and GC overhead limit exceeded?
OOM: Java heap space means the heap is full and GC cannot free enough space for the current allocation. GC overhead limit exceeded means GC is running continuously (>98% of time) and recovering almost nothing (<2% of heap). Both indicate memory pressure, but GC overhead is the JVM's way of saying 'I tried GC and it did not help β you have a leak.' Fix the leak, do not just increase heap.
How much memory should I allocate to the JVM in a container?
Set container memory to 1.43x your -Xmx value. For a 4GB heap, set container limit to 5.7GB. The extra 30-43% covers metaspace (~256MB), thread stacks (~200MB for 200 threads), direct memory (~256MB), GC overhead (~200MB), and OS overhead (~500MB). Use -XX:MaxRAMPercentage=70.0 to set heap as 70% of container memory.
How do I find a memory leak in production?
Step 1: confirm it is a leak by monitoring post-GC old gen baseline β if it rises over hours, it is a leak. Step 2: take two heap dumps 30-60 minutes apart. Step 3: compare histograms in Eclipse MAT to find the fastest-growing object type. Step 4: follow the reference chain to GC root to find who holds the reference. Step 5: fix the reference (add eviction, call remove(), close the resource).
Should I use G1GC, ZGC, or Parallel GC?
G1GC for most services (good balance). ZGC for latency-sensitive services requiring p99 < 10ms (trading, real-time). Parallel GC for batch jobs where throughput matters and pauses are acceptable. Do not switch to ZGC just because it is newer β it adds CPU overhead that is wasted if you do not need sub-millisecond pauses.
My container is killed with exit code 137 but no JVM OOM error β what happened?
The Linux OOM killer terminated your process because total memory (heap + off-heap) exceeded the container memory limit. The JVM did not OOM β the OS killed it. Check if -Xmx equals container memory limit (wrong). Increase container limit to 1.43x heap size. Add NativeMemoryTracking to profile off-heap usage.
How do I diagnose Metaspace OOM?
Metaspace OOM is almost always a classloader leak, not insufficient space. Check if your service uses hot-redeploy without JVM restart. Use jcmd VM.classloader_stats to see classloader counts. Look for classloaders with high class counts that should have been unloaded. The fix is usually avoiding static references to classes from dynamic classloaders.
What is the sawtooth pattern and how does it help detect leaks?
Healthy JVM: heap rises during allocation, drops after GC, returns to the same baseline each time. Leaking JVM: the post-GC baseline rises over time. Monitor old gen usage after each full GC. If the baseline increases monotonically over hours, you have a leak. The post-GC baseline is the only metric that reveals a leak β peak usage is irrelevant.
How do I tune GC pause times?
First, determine if pauses are actually a problem β measure p99 latency. If GC pauses exceed your SLA, options: (1) tune G1GC with -XX:MaxGCPauseMillis and -XX:InitiatingHeapOccupancyPercent, (2) switch to ZGC for sub-millisecond pauses, (3) reduce allocation rate to decrease GC frequency, (4) increase heap to give GC more room. Always enable GC logging to measure the impact of changes.
How do I handle memory in a high-throughput service that allocates a lot of short-lived objects?
Ensure young gen is large enough to hold the working set of short-lived objects. In G1GC, this is automatic. In Parallel GC, tune -XX:NewRatio. Consider object pooling for frequently allocated large objects (but benchmark first β pooling adds complexity and can cause leaks). The most effective optimization is reducing allocation rate: reuse StringBuilder, avoid autoboxing in loops, use primitive collections.
What JVM flags are essential for production?
Non-negotiable: -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/log/jvm/, GC logging (-Xlog:gc*), -XX:+ExitOnOutOfMemoryError. Recommended: -XX:MaxRAMPercentage=70.0 (container), -XX:MaxMetaspaceSize=256m, JFR continuous recording. These flags turn production incidents from guesswork into diagnosis.
What are the five essential debug commands for a JVM memory incident?
- jcmd VM.native_memory summary β shows where all JVM memory is going. 2) jmap -histo:live | head -30 β shows top 30 object types by count and size. 3) jstat -gcutil <pid> 1000 β shows GC activity in real-time. 4) jcmd GC.heap_dump β full heap dump for MAT analysis. 5) jstack <pid> β thread dump for StackOverflowError and ThreadLocal leaks. Run in this order β fast commands first.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.