OOM types: 5 distinct types: Heap Space, Metaspace, Direct Memory, Stack Overflow, and GC Overhead Limit. Each has a different root cause and fix — treat them separately.
Always capture heap dumps: Set -XX:+HeapDumpOnOutOfMemoryError at startup. Without it you are guessing. The dump is the only reliable way to find what was in memory at crash time.
GC pause threshold: Pauses above 200ms in latency-sensitive services indicate a tuning problem. Switch collectors or adjust generation ratios before increasing heap size.
Memory leak signal: A sawtooth heap pattern that never returns to baseline after GC. Analyze the dominator tree in your heap dump to find the root retaining object.
Metaspace OOM: Usually a classloader leak in hot-redeploy environments — not insufficient Metaspace size. Increasing MaxMetaspaceSize just delays the same failure.
Container sizing rule: Set -Xmx to 70–75% of container memory. The remaining 25–30% covers off-heap, thread stacks, metaspace, and OS page cache.
Plain-English First
JVM memory issues are like a warehouse that keeps filling up. The garbage collector is the cleanup crew — if they cannot keep up, the warehouse overflows (OOM). Memory leaks are boxes that nobody ever throws away because someone keeps holding a reference to them. GC tuning is about hiring the right cleanup crew and giving them the right schedule. The key is knowing which type of overflow you have — is it the main warehouse (heap), the filing cabinet (metaspace), the loading dock (direct memory), or the office desks (thread stacks)?
JVM memory failures are the most common cause of unplanned downtime in Java-based production systems. An OOM kill at 2 AM takes down the service, triggers alerts, and forces on-call engineers to diagnose under pressure.
Most OOM errors are preventable. The JVM provides extensive diagnostics — heap dumps, GC logs, JFR recordings — but teams rarely configure them before the incident. By the time the OOM fires, the evidence is already gone unless you captured it proactively.
This guide covers the five OOM types, GC tuning trade-offs, memory leak detection patterns, and the production configurations that prevent most memory-related outages. Every pattern comes from systems running at scale — not textbook examples.
Start with the Quick Debug Cheat Sheet above if you are actively debugging an incident. Use the sections below for deep understanding and prevention.
Why JVM Memory Debugging Is Not Optional
JVM memory debugging is the systematic process of identifying why a Java application consumes more heap than expected, often leading to OutOfMemoryError (OOM). The core mechanic involves capturing heap dumps, analyzing object retention paths, and measuring allocation rates to pinpoint the exact data structures or code paths responsible. Without this discipline, a single ConcurrentHashMap with 14.2 million entries can silently exhaust a 16 GB heap.
In practice, memory debugging relies on two key properties: object reachability from GC roots and allocation frequency. Tools like Eclipse MAT or JProfiler compute retained heap — the memory that would be freed if an object were garbage collected. This reveals that a seemingly small map entry (key+value+overhead ~200 bytes) multiplied by millions becomes gigabytes. The real insight often lies in unexpected retention chains, not just raw object counts.
Use memory debugging when your application shows gradual heap growth, frequent Full GCs, or crashes with OOM. It matters most in production systems with high concurrency or caching layers, where a single unbounded data structure can bring down a service. Teams that skip this step often mistake memory leaks for normal load spikes, leading to costly autoscaling instead of a 10-line fix.
Retained Heap vs. Shallow Heap
Shallow heap is the object's own size; retained heap includes everything it keeps alive. A ConcurrentHashMap may have small shallow size but huge retained heap due to millions of entries.
Production Insight
A payment processing service cached transaction metadata in a ConcurrentHashMap without eviction — 14.2 million entries after 3 days.
Symptom: JVM crashed with 'Java heap space' OOM during peak hours, GC logs showed 95% time spent in Full GC.
Rule: Always bound in-memory caches with size limits (e.g., Guava Cache) or use weak references for ephemeral data.
Key Takeaway
Memory debugging is about finding what holds references, not just what uses memory.
A single unbounded collection is the most common cause of production OOMs.
Heap dump analysis is the only reliable way to distinguish a leak from a legit memory spike.
Production Debugging Quick Map — Symptom to Tool
When a memory incident fires, you need to go from symptom to correct diagnostic tool in seconds. This map is designed to be printed and taped to your monitor.
The key insight: each symptom points to a specific memory region and a specific tool. Using the wrong tool wastes hours. A heap dump does not help with direct memory OOM. GC logs do not help with stack overflow. Match the symptom to the tool.
Decision flow: 1. Read the error message or symptom. 2. Find the matching row in the table below. 3. Run the diagnostic command. 4. Apply the fix.
Severity triage: - Service crashed (OOM) → critical — capture diagnostics immediately - Service degraded (slow) → high — capture diagnostics within 15 minutes - Service trending toward OOM → medium — schedule diagnostics within 1 hour - No symptoms, proactive check → low — run diagnostics during maintenance window
The table below covers the 12 most common production memory scenarios. Each row maps symptom → what to check → which tool → immediate action. This is the scan-first view — use it before reading any section in detail.
Production insight: the most time-consuming part of memory debugging is choosing the right tool. Engineers waste hours running jmap when they should be reading GC logs, or analyzing heap dumps when the issue is off-heap. This table eliminates that wasted time by mapping symptoms directly to tools.
symptom_tool_map.txtTEXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
SYMPTOM | WHATTOCHECK | TOOL | IMMEDIATEACTION
---------------------------------+--------------------------------+--------------------------------+-------------------------------------------
Exit code 137 (no JVM error) | Container memory vs heap | kubectl top + jcmd VM.native | Increase container to 1.43x heap
OOM: Java heap space | Heapcontents (what is big?) | jmap -histo + EclipseMAT | Find leak via dominator tree
OOM: Metaspace | Classloader count | jcmd VM.classloader_stats | RestartJVM, fix classloader leak
OOM: Direct buffer memory | Direct buffer allocation | jcmd VM.native_memory summary | Fix buffer leak, set MaxDirectMemorySizeOOM: GC overhead limit | GC frequency + old gen usage | jstat -gcutil + GC logs | Fix memory leak (GC cannot free enough)
StackOverflowError | Call stack depth | jstack <pid> | Convert recursion to iteration
Latencyspikes (no OOM) | GC pause times | GClogs (-Xlog:gc) | TuneGC or switch collector
CPU high + slow response | GC time percentage | jstat -gcutil <pid> 1s | IfGC time > 5%, fix leak or increase heap
Memory grows over hours | Old gen trend (post-GC) | jstat -gcutil + jmap -histo | Compare histograms, find growing types
OOM only at high traffic | Allocation rate | JFR (settings=profile) | Reduce allocation rate or increase heap
OOM only in production | Object count comparison | jmap -histo (prod vs staging) | Find data-dependent leak
OOM after code deploy | Codediff (new caches/threads) | git diff + heap dump | Checkfor removed eviction logic
The 60-Second Triage Rule
Error message contains 'heap space' → heap region → jmap + MAT → leak or undersized heap.
No error message, just exit code 137 → container limit → kubectl top → off-heap exceeded container limit.
No crash, just slow → GC pauses → GC logs → collector tuning or leak.
Production Insight
An on-call engineer received an OOM alert at 3 AM. The error was 'java.lang.OutOfMemoryError: Java heap space.' The engineer ran jcmd VM.native_memory summary (wrong tool — that is for off-heap). The output showed nothing unusual. Then they ran kubectl top pod (wrong tool — that is for container-level). Still nothing. Then they ran jstat -gcutil (useful but not sufficient). After 45 minutes of wrong tools, they finally ran jmap -histo:live and found a HashMap with 8 million entries in 30 seconds.
Cause: mismatched symptom-to-tool mapping. Effect: 45 minutes of wasted debugging time during a 3 AM incident. Impact: extended outage, delayed root cause identification. Action: printed the symptom-to-tool map and taped it to every engineer's monitor. Result: subsequent incidents triaged in under 60 seconds.
The lesson: having the right tools is not enough. You need the right tool for the right symptom. A cheat sheet that maps symptoms to tools eliminates the most common source of debugging delays.
Key Takeaway
Symptom determines the memory region. Memory region determines the tool. Tool determines the root cause. Print the symptom-to-tool map and eliminate 45 minutes of wrong-tool debugging during incidents.
Which Tool to Use for Each Memory Region
IfSuspected heap leak (heap space OOM, growing old gen)
→
Usejmap -histo:live for object counts, jmap -dump for heap dump analysis in MAT. Use jstat -gcutil to confirm old gen growth trend.
IfSuspected off-heap issue (container OOM, direct buffer OOM)
→
Usejcmd VM.native_memory summary for breakdown. kubectl top pod for total container usage. Check -XX:MaxDirectMemorySize.
IfSuspected GC problem (latency spikes, GC overhead OOM)
→
UseGC logs (-Xlog:gc*) for pause times and frequency. jstat -gcutil for real-time GC activity. Check collector type with PrintCommandLineFlags.
IfSuspected classloader leak (Metaspace OOM)
→
Usejcmd VM.classloader_stats for classloader counts. Check for hot-redeploy without JVM restart.
IfSuspected thread issue (StackOverflowError, high thread count)
→
Usejstack for thread dump. ThreadMXBean.getThreadCount() for thread count. Check -Xss setting.
IfSuspected allocation rate issue (OOM only at high traffic)
→
UseJFR with settings=profile for allocation hotspots. jstat -gcutil for allocation rate estimation. Check GC logs for promotion rate.
Essential JVM Debug Commands — Complete Reference
Every production JVM memory incident requires specific commands. This section is the complete reference — categorized by tool, with exact syntax and what to look for in the output.
These commands assume JDK 11+ syntax. For JDK 8, some flags differ (noted where applicable).
Critical rule: always run diagnostic commands as the same user that owns the JVM process. In containers, exec into the container: kubectl exec -it <pod> -- /bin/bash.
jcmd — the Swiss Army knife. Replaces jinfo, jmap, jstack, and jstat for most operations. Available on all JDK 11+ installations. One tool, many functions.
jmap — heap dump and histogram. The primary tool for heap analysis. jmap -histo:live forces a full GC before counting, showing only live objects. jmap -dump:live creates a heap dump file for MAT analysis.
jstat — real-time GC monitoring. Shows GC activity in real-time without stopping the JVM. The -gcutil flag shows usage percentages for each generation. Run with 1-second interval for live debugging.
jstack — thread dump. Shows all threads and their stack traces. Essential for StackOverflowError and thread-related memory issues (ThreadLocal accumulation).
JFR — Java Flight Recorder. Low-overhead continuous profiling. Captures allocation patterns, GC events, and lock contention. Can run in production with <2% overhead.
Production insight: the most commonly confused commands are jmap -histo (object counts, fast) and jmap -dump (full heap dump, slow, pauses JVM). Use -histo first to get a quick overview. Only use -dump when you need the full object graph for MAT analysis. Dumping a 16GB heap pauses the JVM for 10-30 seconds.
Edge case: in Kubernetes, the JVM process PID is usually 1 (the container entrypoint). If your container runs a wrapper script, the JVM PID may be different. Use ps aux | grep java to find the actual PID. Some commands require JAVA_HOME to be set — verify with echo $JAVA_HOME before running.
jvm_debug_commands.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
#!/bin/bash
# ============================================================
# JVMDebugCommands — ProductionReference
# Run from inside the container or on the host with JVM access
# ============================================================
PID=$(pgrep -f 'java.*-Xmx') # FindJVMPID
# ============================================================
# JCMD — SwissArmyKnife (JDK11+)
# ============================================================
# List all JVM processes
jcmd
# JVMsummary (uptime, arguments, heap config)
jcmd $PIDVM.info
# Native memory breakdown (heap, thread, class, GC, direct)
jcmd $PIDVM.native_memory summary
jcmd $PIDVM.native_memory summary.diff # Since last baseline
jcmd $PIDVM.native_memory baseline # Set baseline for diff
# Classloaderstatistics (class count, classloader count)
jcmd $PIDVM.classloader_stats
# GCclassstatistics (instance count and size by class)
jcmd $PIDGC.class_stats | head -20
# Force full GC
jcmd $PIDGC.run
# Print all VM flags
jcmd $PIDVM.flags -all | grep -E '(HeapDump|GC|Metaspace|DirectMemory|ThreadStackSize)'
# Print command line flags (shows effective GC settings)
jcmd $PIDVM.command_line
# Threaddump (replaces jstack)
jcmd $PIDThread.print
# Heap dump
jcmd $PIDGC.heap_dump /tmp/heap.hprof
# Heaphistogram (live objects only, forces GC)
jcmd $PIDGC.class_histogram | head -30
# JFR: start recording
jcmd $PIDJFR.start name=debug settings=profile maxsize=100M maxage=1h
# JFR: dump recording
jcmd $PIDJFR.dump name=debug filename=/tmp/recording.jfr
# JFR: stop recording
jcmd $PIDJFR.stop name=debug
# ============================================================
# JMAP — HeapDump and Histogram
# ============================================================
# Histogram of live objects (top 30 by count)
jmap -histo:live $PID | head -30
# Histogram of all objects (including unreachable — faster, no GC)
jmap -histo $PID | head -30
# Full heap dump (live objects only — forces GC first)
jmap -dump:live,format=b,file=/tmp/heap.hprof $PID
# Full heap dump (all objects — faster but larger file)
jmap -dump:format=b,file=/tmp/heap_all.hprof $PID
# ============================================================
# JSTAT — Real-TimeGCMonitoring
# ============================================================
# GC utilization every 1 second, 10 samples
jstat -gcutil $PID100010
# Output columns:
# S0 — Survivor0 usage %
# S1 — Survivor1 usage %
# E — Eden usage %
# O — Old gen usage % ← KEYMETRICfor leak detection
# M — Metaspace usage %
# CCS — Compressedclass space usage %
# YGC — YoungGC count
# YGCT — YoungGC total time (seconds)
# FGC — FullGC count ← SHOULDBE0 in healthy service
# FGCT — FullGC total time (seconds)
# GCT — TotalGCtime (seconds)
# Key diagnostic:
# If O (old gen) keeps growing after GC → memory leak
# IfFGC > 0 and increasing → old gen pressure
# IfGCT/uptime > 5% → GC overhead problem
# ============================================================
# JSTACK — ThreadDump
# ============================================================
# Full thread dump
jstack $PID > /tmp/threads.txt
# Thread dump with lock information
jstack -l $PID > /tmp/threads_locked.txt
# Count threads by state (useful for thread leak detection)
jstack $PID | grep "java.lang.Thread.State" | sort | uniq -c | sort -rn
# ============================================================
# KUBERNETES / CONTAINERCOMMANDS
# ============================================================
# Pod memory usage
kubectl top pod <pod-name> --containers
# Pod memory limits and usage
kubectl describe pod <pod-name> | grep -A 10"Limits\|Requests"
# ContainerOOM kill events
kubectl get events --field-selector reason=OOMKilling
# Exec into running container
kubectl exec -it <pod-name> -- /bin/bash
# Check container memory limit from inside container
cat /sys/fs/cgroup/memory/memory.limit_in_bytes # cgroup v1
cat /sys/fs/cgroup/memory.max # cgroup v2
# Check container memory usage from inside container
cat /sys/fs/cgroup/memory/memory.usage_in_bytes # cgroup v1
cat /sys/fs/cgroup/memory.current # cgroup v2
# ============================================================
# QUICKDIAGNOSTICSEQUENCE (Runthisfor any OOM)
# ============================================================
echo "=== Quick JVM Memory Diagnostic ==="
echo "PID: $PID"
echo ""
echo "--- 1. JVM Flags ---"
jcmd $PIDVM.flags -all | grep -E '(MaxHeap|MaxMetaspace|MaxDirect|ThreadStack|GC)'
echo ""
echo "--- 2. Native Memory Summary ---"
jcmd $PIDVM.native_memory summary
echo ""
echo "--- 3. Heap Histogram (top 15) ---"
jmap -histo:live $PID | head -15
echo ""
echo "--- 4. GC Status ---"
jstat -gcutil $PID10005
echo ""
echo "--- 5. Thread Count ---"
jcmd $PIDThread.print | grep "java.lang.Thread.State" | wc -l
echo ""
echo "=== Diagnostic Complete ==="
The Five Commands You Need at 2 AM
jcmd $PID VM.native_memory summary — shows where all JVM memory is going (heap, threads, metaspace, direct).
jmap -histo:live $PID | head -30 — shows top 30 object types by count and size. Fast, no heap dump needed.
jstat -gcutil $PID 1000 — shows GC activity in real-time. Old gen growing = leak. Full GC count rising = pressure.
jcmd $PID GC.heap_dump /tmp/heap.hprof — full heap dump for MAT analysis. Pauses JVM — use only when needed.
jstack $PID — thread dump for StackOverflowError and ThreadLocal leak detection.
Production Insight
A team had no standardized debugging process for memory incidents. Each engineer used different commands in different order. One engineer spent 20 minutes trying to find the JVM PID. Another ran jmap -dump (slow, pauses JVM) before running jmap -histo (fast, no pause) — the dump took 3 minutes on a 16GB heap and the service became unresponsive.
The team created a standardized diagnostic script that runs the five essential commands in the correct order: flags (5 seconds), native memory (5 seconds), histogram (10 seconds), GC status (5 seconds), thread count (5 seconds). Total time: 30 seconds. The script runs automatically when an OOM alert fires.
Cause: no standardized diagnostic process. Effect: 20+ minutes of ad-hoc debugging per incident, wrong command order causing service disruption. Impact: extended outages, on-call burnout. Action: created automated diagnostic script, printed command cheat sheet. Result: 30-second diagnostic baseline, consistent debugging across all engineers.
Key insight: the order matters. Run fast commands first (flags, histogram, GC status). Run slow commands only if fast commands do not reveal the issue. Never run jmap -dump before jmap -histo — the histogram often reveals the problem without needing the full dump.
Key Takeaway
Five commands cover 95% of memory incidents: native_memory summary, jmap -histo, jstat -gcutil, GC.heap_dump, and Thread.print. Run fast commands first. Never dump before histogram. Print the cheat sheet.
Which Command to Run First
IfJust received OOM alert — need quick triage
→
UseRun jcmd VM.native_memory summary (5 sec) + jmap -histo:live | head -30 (10 sec). Total 15 seconds. This covers 80% of incidents.
IfHistogram shows no dominant object — need full analysis
→
UseRun jcmd GC.heap_dump /tmp/heap.hprof. Analyze in Eclipse MAT. Check dominator tree and histogram comparison.
IfService is slow but not crashed — suspect GC
→
UseRun jstat -gcutil $PID 1000 for 30 seconds. If old gen is full and Full GC count is rising, you have old gen pressure.
IfStackOverflowError or thread-related issue
→
UseRun jstack $PID. Look for repeating method signatures in the stack trace. Count threads by state.
IfContainer OOM kill (exit 137) — no JVM error
→
UseRun kubectl describe pod + kubectl top pod. Then run jcmd VM.native_memory summary inside the container to profile off-heap.
IfNeed continuous profiling without stopping the service
Most developers treat OOM as a single error. It is not. The JVM has five distinct OOM conditions, each with different causes, diagnostics, and fixes. Treating them interchangeably leads to misdiagnosis.
Java heap space — the most common. The heap (young gen + old gen) is full and GC cannot free enough space. Almost always a memory leak or undersized heap.
Metaspace — class metadata storage is full. Common in hot-redeploy environments where classloaders accumulate. Rarely a sizing issue — almost always a classloader leak.
Direct buffer memory — off-heap NIO buffer allocation failed. Common in Netty, gRPC, and NIO-based services. Usually a buffer leak or insufficient MaxDirectMemorySize.
GC overhead limit exceeded — GC is running continuously and recovering almost nothing. The JVM's way of saying 'I tried GC, it did not help, you have a leak.' This is a leak indicator, not a sizing issue.
Stack overflow — thread call stack exceeded -Xss. Not a memory leak — it is a recursion depth problem. But it manifests as an OOM in monitoring.
The critical insight: each OOM type requires a different diagnostic approach. A heap dump does not help with Metaspace OOM. Increasing -Xmx does not fix direct buffer memory OOM. Matching the OOM type to the correct diagnostic tool is the first step.
Production edge case: some OOM types are caught by the JVM (heap space, metaspace), while others kill the process externally. Container OOM killer (exit code 137) bypasses the JVM entirely — no heap dump, no error message, just a dead process. This is why container memory limits must account for off-heap usage.
Performance implication: each OOM type has different latency characteristics. Heap OOM causes gradual degradation (GC pauses increase). Metaspace OOM is sudden (class loading fails). Direct memory OOM is sudden (buffer allocation fails). Stack overflow is immediate (thread dies). Understanding the failure mode helps you detect it earlier.
The Five OOM Types — Each Needs a Different Diagnostic
Heap space: heap dump (jmap, -XX:+HeapDumpOnOutOfMemoryError). Look at dominator tree for leak suspects.
Metaspace: classloader analysis (jcmd VM.classloader_stats). Look for classloaders with high class count that should have been unloaded.
Direct buffer: NativeMemoryTracking (-XX:NativeMemoryTracking=detail, jcmd VM.native_memory). Look for buffer allocation without corresponding release.
GC overhead: heap dump + GC log analysis. The leak is in old gen — look for objects that survive full GC.
A microservices team spent 3 days debugging a Metaspace OOM by increasing MaxMetaspaceSize from 256MB to 1GB. The OOM returned after 2 days. The real issue was a classloader leak caused by a reflection-based plugin system that cached Class objects in a static HashMap. Each redeployment loaded new classes but the old Class references were never released. The static HashMap grew indefinitely.
Cause: static HashMap caching Class objects from dynamically loaded classloaders. Effect: old classloaders could not be GC'd because the static map held references. Metaspace grew by 50MB per redeployment. Impact: service crashed every 2-3 days. Action: replaced static HashMap with WeakHashMap, added classloader leak detection using -verbose:class. Result: Metaspace stabilized at 80MB, no further OOMs.
Trade-off: WeakHashMap entries can be GC'd at any time, which means cached Class lookups may return null. Added a fallback path that reloads the class if the WeakHashMap entry was collected. Performance impact: ~0.1ms per cache miss, acceptable for a plugin system.
Key Takeaway
Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging. Heap dump for heap space, classloader stats for metaspace, NMT for direct memory, GC logs for overhead, thread dump for stack overflow.
Which OOM Type Are You Dealing With
Heap Dump Analysis: Finding the Leak
A heap dump is a snapshot of every object in the JVM heap at a point in time. It is the single most important diagnostic artifact for heap OOM. Without it, you are guessing. With it, you can identify the exact object, its reference chain to GC root, and its retained size.
The key concept is the dominator tree. In a heap dump, object A dominates object B if every path from GC roots to B goes through A. The dominator tree shows which objects retain the most memory. The top entries in the dominator tree are your leak suspects.
Eclipse MAT (Memory Analyzer Tool) is the standard tool for heap dump analysis. The three reports that matter most: Leak Suspects Report (automated analysis), Dominator Tree (manual exploration), and Histogram (object count by type).
The Leak Suspects Report is the starting point. It identifies objects with unusually high retained size and shows the reference chain from GC root. If the report identifies a single suspect consuming 60%+ of heap, you have found the leak.
But the automated report does not always find the leak. Some leaks are distributed — no single object dominates, but thousands of small objects accumulate. In this case, use the Histogram to find object types with unexpectedly high counts. Compare with a second heap dump taken 1 hour later. The type with the fastest-growing count is the leak source.
Production insight: always take at least two heap dumps, 30-60 minutes apart. A single dump shows the current state. Two dumps show the trend. The trend is what reveals leaks.
Heap dump caveat: taking a heap dump pauses the JVM (full stop-the-world) for the duration of the dump. For a 4GB heap, this can be 10-30 seconds. For a 32GB heap, it can be several minutes. Never take a heap dump on a production system during peak traffic without understanding the pause impact. Use jmap -dump:live,format=b,file=heap.hprof <pid> to force a full GC first and capture only live objects, reducing dump size.
Alternative for large heaps: use JFR allocation profiling (-XX:StartFlightRecording=settings=profile) to capture allocation patterns without a full heap dump. JFR adds less than 2% overhead and can run continuously in production. It does not show object graphs, but it shows which code is allocating the most memory.
Performance trade-off: heap dump pause time is proportional to live object count, not heap size. A 16GB heap with 2GB live objects dumps faster than an 8GB heap with 6GB live objects. Use -XX:+HeapDumpOnOutOfMemoryError (auto-dump on OOM) and -XX:HeapDumpPath=/var/log/jvm/ to ensure dumps are captured even during unattended failures.
Single dump: shows what is in the heap now. Useful for finding large objects. Cannot distinguish leak from legitimate usage.
Two dumps: shows what is growing. The object type with the fastest-growing count is the leak source.
Dominator tree: shows which objects retain the most memory. Top entries are leak suspects.
Leak Suspects Report: automated MAT analysis. Good starting point. Fails on distributed leaks (many small objects).
Histogram comparison: export histograms from both dumps, diff them. The type with the largest count increase is the leak.
Production Insight
A recommendation engine service used 12GB of its 16GB heap. The team took a single heap dump and found no single object dominating memory — the largest retained object was 200MB. They concluded the heap was simply too small and requested 32GB from infrastructure.
A senior engineer took two dumps 45 minutes apart and compared histograms. The count of io.thecodeforge.model.CachedRecommendation objects grew from 8.2 million to 8.7 million in 45 minutes — 666,000 new objects/hour, each ~1.2KB. The leak was distributed across millions of small objects, invisible in a single dump's dominator tree.
Cause: recommendation cache had no eviction policy. Each unique user+product combination created a CachedRecommendation that was never removed. Effect: 666K new objects/hour, ~800MB/hour growth. Impact: OOM every 20 hours. Action: added Caffeine cache with expireAfterWrite(1, TimeUnit.HOURS) and maximumSize(5_000_000). Result: steady-state heap dropped to 4GB, no OOM.
Key insight: single dump analysis missed this leak entirely because no single object dominated. Two-dump histogram comparison revealed it in minutes.
Key Takeaway
Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type. The dominator tree finds large objects; histogram comparison finds distributed leaks.
Heap Dump Analysis Strategy
GC Tuning: Collector Selection and Parameter Optimization
GC tuning is about trade-offs: throughput vs latency, pause time vs frequency, memory efficiency vs allocation speed. There is no universal best setting — the right configuration depends on your workload profile.
The four production GC collectors:
G1GC (default since JDK 9): balanced throughput and latency. Good default for most services. Tuning targets: -XX:MaxGCPauseMillis (default 200ms), -XX:G1HeapRegionSize, -XX:InitiatingHeapOccupancyPercent.
ZGC (JDK 15+): sub-millisecond pause times regardless of heap size. Best for latency-sensitive services (trading, real-time). Trade-off: slightly lower throughput, higher CPU usage for concurrent GC threads.
Shenandoah (JDK 12+): similar to ZGC — low pause times, concurrent compaction. Trade-off: same as ZGC. Choose based on JDK vendor support.
Parallel GC: highest throughput, longest pauses. Best for batch processing where latency does not matter. Not recommended for interactive services.
The most common GC tuning mistake: switching collectors without understanding the workload. A team switched from G1GC to ZGC because they read it was 'faster.' Their service was a batch ETL pipeline that did not care about pause times. ZGC's extra CPU overhead reduced throughput by 8% for zero benefit.
Rule of thumb: if your service is latency-sensitive (p99 < 100ms), use ZGC or Shenandoah. If throughput matters more than latency, use Parallel GC. For everything else, G1GC is the right default.
Humongous allocations are a G1GC-specific problem. Objects larger than 50% of a G1 region (default region size is ~1-2MB depending on heap) are classified as humongous. They are allocated in contiguous regions and only reclaimed during full GC. If your service allocates many large byte arrays or StringBuilders, humongous allocations cause premature old gen promotion and full GC storms.
Fix: increase -XX:G1HeapRegionSize to reduce humongous threshold, or refactor code to avoid large contiguous allocations. Check GC logs for 'Humongous allocation' lines.
GC log analysis is essential. Enable GC logging with -Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M (JDK 11+). Key metrics to monitor: GC pause duration (max, p99, p95), GC frequency (pauses per minute), allocation rate (MB/sec), promotion rate (young gen to old gen MB/sec), and old gen usage after GC.
Production insight: the most impactful GC parameter is often not the collector itself, but the heap size relative to live data. If your live data set is 2GB and your heap is 8GB, GC has plenty of room to work. If your live data set is 6GB and your heap is 8GB, GC is constantly under pressure. Right-sizing the heap matters more than collector selection.
Edge case: containerized JVMs with cgroup memory limits. Prior to JDK 10, the JVM did not respect cgroup limits and would set heap based on host memory. JDK 10+ respects cgroup limits. Always verify with -XX:+PrintFlagsFinal | grep MaxHeapSize that the JVM sees the correct memory limit.
A trading platform used G1GC with 32GB heap. During market open, GC pauses reached 400ms — causing order processing delays and regulatory violations. The team tuned G1GC parameters for 3 weeks, reducing pauses to 250ms. Still not good enough.
Switching to ZGC reduced pauses to 0.8ms consistently. The trade-off: ZGC used 15% more CPU for concurrent GC threads. The platform had spare CPU capacity, so this was acceptable.
Cause: G1GC stop-the-world pauses during concurrent marking. Effect: 400ms pauses during peak allocation rate. Impact: order processing delays, regulatory SLA violations. Action: switched to ZGC with -XX:+UseZGC -Xmx32g -XX:ConcGCThreads=4. Result: 0.8ms p99 pauses, 15% CPU increase, zero SLA violations.
Trade-off: if the platform had been CPU-bound, ZGC's overhead would have been unacceptable. The fix worked because CPU was the cheaper resource to trade for latency. Always profile CPU usage before switching collectors.
Key Takeaway
GC tuning is about trade-offs: throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC. Full GC is always a problem — find the root cause.
GC Collector Selection
IfService requires p99 latency < 50ms
→
UseUse ZGC (JDK 15+) or Shenandoah (JDK 12+). Sub-millisecond pauses. Accept higher CPU overhead.
IfService is a batch job or ETL pipeline
→
UseUse Parallel GC. Highest throughput. Long pauses are acceptable since there is no user waiting.
IfGeneral-purpose web service or API
→
UseUse G1GC (JDK 9+ default). Tune MaxGCPauseMillis to your SLA. Good balance of throughput and latency.
Memory Leak Patterns and Detection
Memory leaks in Java are objects that are no longer needed but remain referenced, preventing garbage collection. Unlike C/C++ leaks (freed memory), Java leaks are reachable objects that should be unreachable.
The five most common leak patterns in production:
Unbounded collections — Maps, Lists, or Sets that grow without limit. The #1 cause of heap OOM. Fix: use bounded caches (Caffeine, Guava) with TTL and maximumSize.
Listener/callback registration without deregistration — registering event listeners that hold references to the subscriber object. When the subscriber should be GC'd, the listener reference keeps it alive. Fix: always deregister in close()/destroy() methods.
ThreadLocal without cleanup — ThreadLocal values persist for the lifetime of the thread. In thread pools, threads live forever. ThreadLocal values accumulate indefinitely. Fix: call threadLocal.remove() in a finally block after use.
ClassLoader leaks — in hot-redeploy environments, old classloaders remain referenced by static fields or thread-locals. The classloader cannot be GC'd, and neither can all classes it loaded. Fix: avoid static references to classes from dynamic classloaders. Use WeakReference or ServiceLoader patterns.
String.intern() abuse — String.intern() stores strings in the string pool (native memory pre-JDK 7, heap post-JDK 7). Interning user-generated strings creates an unbounded pool. Fix: never intern user input. Use a bounded cache with eviction instead.
Detection strategy: the sawtooth test. Monitor heap usage over time. A healthy JVM shows a sawtooth pattern — heap rises during allocation, drops after GC, returns to the same baseline. A leak shows the same sawtooth, but the baseline after GC increases over time. The post-GC baseline is the key metric.
Production tool: Java Flight Recorder (JFR) with allocation profiling. JFR records every significant allocation with the call stack. Enable with -XX:StartFlightRecording=settings=profile,duration=60s,filename=alloc.jfr. Analyze with JDK Mission Control (JMC) — the 'Allocation by Thread' and 'Allocation by Class' views show where memory is being allocated.
Edge case: soft reference accumulation. The JVM collects SoftReferences only when heap pressure is high. If your cache uses SoftReferences, it will consume all available heap before releasing entries. This is by design, but it makes heap appear full even when it is not leaking. Switch to WeakReference or use a proper cache library with size-based eviction.
Performance consideration: leak detection tools (JFR, MAT) add overhead. JFR adds <2% CPU overhead and can run continuously. MAT analysis requires a heap dump, which pauses the JVM. Use JFR for continuous monitoring and MAT for post-mortem analysis.
leak_detector.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
package io.thecodeforge.diagnostics;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.GarbageCollectorMXBean;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
/**
* MemoryLeakDetector — monitors old gen growth rate
* to detect leaks before OOM occurs.
*
* Core insight: a leak shows as increasing old gen usage
* after each full GC. The post-GC baseline is the key metric.
*/
publicclassMemoryLeakDetector {
privatefinalScheduledExecutorService scheduler;
privatefinalList<OldGenSnapshot> snapshots;
privatefinaldouble alertThresholdMBPerHour;
privatefinalLeakAlertHandler alertHandler;
publicinterfaceLeakAlertHandler {
voidonLeakDetected(double growthRateMBPerHour,
long currentOldGenMB,
String recommendation);
}
publicMemoryLeakDetector(
double alertThresholdMBPerHour,
LeakAlertHandler alertHandler
) {
this.alertThresholdMBPerHour = alertThresholdMBPerHour;
this.alertHandler = alertHandler;
this.snapshots = newArrayList<>();
this.scheduler = Executors.newSingleThreadScheduledExecutor(
r -> {
Thread t = newThread(r, "leak-detector");
t.setDaemon(true);
return t;
}
);
}
publicvoidstart(long intervalSeconds) {
scheduler.scheduleAtFixedRate(
this::sampleOldGen,
intervalSeconds,
intervalSeconds,
TimeUnit.SECONDS
);
}
privatevoidsampleOldGen() {
long oldGenUsedMB = getOldGenUsedMB();
Instant now = Instant.now();
snapshots.add(newOldGenSnapshot(now, oldGenUsedMB));
// Keep only last 6 hoursInstant cutoff = now.minusSeconds(21600);
snapshots.removeIf(s -> s.timestamp.isBefore(cutoff));
// Need at least 30 minutes of dataif (snapshots.size() < 6) return;
// Calculate growth rateOldGenSnapshot oldest = snapshots.get(0);
OldGenSnapshot newest = snapshots.get(snapshots.size() - 1);
double hoursElapsed = (newest.timestamp.toEpochMilli()
- oldest.timestamp.toEpochMilli()) / 3_600_000.0;
if (hoursElapsed < 0.5) return;
double growthRateMBPerHour = (newest.usedMB - oldest.usedMB)
/ hoursElapsed;
if (growthRateMBPerHour > alertThresholdMBPerHour) {
String recommendation = buildRecommendation(
growthRateMBPerHour, newest.usedMB);
alertHandler.onLeakDetected(
growthRateMBPerHour, newest.usedMB, recommendation);
}
}
privatelonggetOldGenUsedMB() {
for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
String name = pool.getName();
if (name.contains("Old") || name.contains("Tenured")) {
return pool.getUsage().getUsed() / (1024 * 1024);
}
}
// Fallback: use heap usageMemoryMXBean memBean = ManagementFactory.getMemoryMXBean();
return memBean.getHeapMemoryUsage().getUsed() / (1024 * 1024);
}
privateStringbuildRecommendation(
double growthRateMBPerHour, long currentOldGenMB
) {
StringBuilder sb = newStringBuilder();
sb.append("Memory leak detected. ");
sb.append("Growth rate: ").append(String.format("%.1f", growthRateMBPerHour));
sb.append(" MB/hour. ");
sb.append("Current old gen: ").append(currentOldGenMB).append(" MB. ");
sb.append("Actions: ");
sb.append("1) Capture heap dump (jmap -dump:live,format=b,file=heap.hprof). ");
sb.append("2) Analyze with MAT — check dominator tree and histogram. ");
sb.append("3) Compare with previous histogram to find growing object types.");
return sb.toString();
}
privatestaticclassOldGenSnapshot {
finalInstant timestamp;
finallong usedMB;
OldGenSnapshot(Instant timestamp, long usedMB) {
this.timestamp = timestamp;
this.usedMB = usedMB;
}
}
}
The Sawtooth Test — Is It a Leak or Just Load?
Healthy pattern: heap rises to 4GB, GC brings it back to 1.5GB. Next cycle: rises to 4GB, back to 1.5GB. Baseline is stable.
Leak pattern: heap rises to 4GB, GC brings it to 1.5GB. Next cycle: rises to 4GB, back to 1.8GB. Next: back to 2.1GB. Baseline is rising.
Key metric: old gen usage after full GC. Monitor this, not peak heap usage.
Detection: take snapshots every 30 seconds. Calculate growth rate of post-GC baseline. Alert if >5% per hour.
False positive: legitimate cache growth (new data being cached) looks like a leak. Distinguish by checking if the growth stabilizes.
Production Insight
A session management service showed stable memory usage for 6 months. After a feature release, the team noticed heap usage after GC growing at 100MB/hour. They suspected a leak but could not find it in the heap dump — no single object dominated.
The leak was a ThreadLocal in a request filter that stored user context. The filter was called on every request, and the ThreadLocal was set but never removed. In a thread pool, threads live forever, so ThreadLocal values accumulated indefinitely. Each user context was ~2KB. At 50,000 unique users per hour, that was 100MB/hour.
Cause: ThreadLocal.set() without ThreadLocal.remove() in a request filter. Effect: each thread accumulated user contexts for every user it served. Impact: 100MB/hour growth, OOM every 10 hours. Action: added threadLocal.remove() in a finally block after request processing. Result: memory growth dropped to zero.
Why the heap dump did not help: ThreadLocal values are stored in the Thread object's threadLocals map, not in a global collection. The dominator tree showed many Thread objects, each holding a small map. Without knowing to look at ThreadLocal, the dump appeared healthy.
Key Takeaway
Monitor post-GC old gen baseline, not peak heap usage. A rising baseline confirms a leak. Take two dumps 30-60 minutes apart and compare histograms. ThreadLocal and unbounded caches are the most common production leak sources.
Memory Leak Detection Strategy
Production JVM Configuration: Flags That Matter
JVM configuration is where most memory incidents are prevented — or caused. The wrong flags make debugging impossible. The right flags make it trivial.
Non-negotiable production flags:
-XX:+HeapDumpOnOutOfMemoryError — captures a heap dump when OOM occurs. Without this, you have no diagnostic data after the crash. Set -XX:HeapDumpPath to a persistent directory (not /tmp in containers — /tmp is often tmpfs and too small).
-Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M — enables GC logging with rotation. Essential for diagnosing GC issues. JDK 11+ syntax. For JDK 8, use -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log.
-XX:+ExitOnOutOfMemoryError — kills the JVM immediately on OOM instead of leaving it in an undefined state. In containerized environments, this ensures the container restarts via the orchestrator. Without this, the JVM may continue running in a degraded state, accepting requests it cannot process.
-XX:MaxRAMPercentage=70.0 — sets max heap as a percentage of container memory. Alternative to -Xmx for containerized deployments. Automatically adjusts when container limits change. Use 70-75% to leave room for off-heap.
Rule of thumb: set container memory limit to 1.3-1.5x your -Xmx value. For a 4GB heap, set container limit to 5.2-6GB. This covers metaspace (~100-200MB), thread stacks (200 threads × 1MB = 200MB), direct memory (~256MB), and OS overhead (~500MB).
Thread stack sizing: -Xss sets stack size per thread. Default is 512KB-1MB depending on OS. For services with many threads, this matters. 500 threads × 1MB = 500MB of stack memory. If your call depth is shallow, reduce to -Xss256k. If you have deep recursion, increase to -Xss2m.
Metaspace sizing: -XX:MaxMetaspaceSize limits metaspace growth. Without this limit, metaspace can consume all available native memory. Set it to a reasonable value (256MB-512MB for most services). If you hit the limit, it indicates a classloader leak, not insufficient space.
JFR continuous recording: -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h — enables continuous JFR recording with rolling buffer. When an incident occurs, dump the recording with jcmd <pid> JFR.dump. This gives you allocation, GC, and lock profiling data without restarting the service.
Edge case: -XX:+UseCompressedOops is enabled by default for heaps <32GB. It compresses object pointers from 8 bytes to 4 bytes, saving ~20% heap. Above 32GB, compressed oops are disabled and each object pointer costs 8 bytes. This means a 34GB heap may perform worse than a 31GB heap due to pointer size increase. Either stay under 32GB or go significantly above (40GB+).
Heap (Xmx): 70% of container memory. This is your working memory for objects.
Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. Reduce with -Xss256k if call depth is shallow.
Metaspace: 100-256MB for most services. Set MaxMetaspaceSize to prevent runaway growth.
Direct memory: default equals Xmx. Set MaxDirectMemorySize explicitly if using NIO/Netty.
OS overhead: 300-500MB for page cache, file descriptors, socket buffers. Never allocate 100% of container memory to JVM.
Production Insight
A Kubernetes deployment set container memory limit to 4GB and -Xmx to 4GB. The service ran fine during normal traffic. During a traffic spike, the container was OOM-killed (exit code 137) every 2-3 hours. No JVM OOM error was logged — the OS killed the process before the JVM could detect the issue.
The team added NativeMemoryTracking and discovered the JVM was using 4.8GB total: 4GB heap + 300MB metaspace + 200MB thread stacks + 300MB direct memory. The container limit was 4GB, so the OS killed the process when total usage exceeded the limit.
Cause: -Xmx set equal to container memory limit with no room for off-heap. Effect: container OOM killer terminated the process. Impact: 3-5 restarts per day during peak traffic. Action: increased container limit to 6GB (4GB × 1.5), kept -Xmx at 4GB. Result: zero OOM kills.
Lesson: container memory limit must be 1.3-1.5x the heap size. The extra 30-50% covers off-heap usage that the JVM does not track against -Xmx.
Key Takeaway
Set container memory to 1.43x your heap size. Always enable heap dump on OOM, GC logging, and JFR. These three flags turn production memory incidents from guesswork into diagnosis. Without them, you are flying blind.
JVM Flag Configuration Decisions
Off-Heap Memory: Direct Buffers, Native Memory, and Thread Stacks
Most JVM memory guides focus exclusively on heap. In production, off-heap memory causes at least 30% of OOM incidents. The container OOM killer does not care whether the memory is heap or off-heap — it kills when total usage exceeds the limit.
Direct ByteBuffer — allocated via ByteBuffer.allocateDirect(). Lives outside the heap in native memory. Used by NIO channels, Netty, gRPC, and file I/O. The JVM tracks direct buffer usage against -XX:MaxDirectMemorySize (default = -Xmx). If direct buffer allocation exceeds this limit, you get OOM: Direct buffer memory.
The insidious part: direct buffers are freed by a ReferenceQueue-based cleaner, not immediately when the buffer is GC'd. If the application allocates direct buffers faster than the GC and cleaner can reclaim them, you get OOM even though the buffers are technically unreachable. This is a rate problem, not a leak problem.
Thread stacks — each thread has a stack of size -Xss. Default is 512KB-1MB. 500 threads × 1MB = 500MB. This memory is allocated at thread creation and never shrinks. In services with dynamic thread pools, thread count can grow under load, consuming more stack memory.
Metaspace — class metadata storage. Replaces PermGen (JDK 7). Grows as classes are loaded. Bounded by -XX:MaxMetaspaceSize. Unbounded by default — can consume all native memory if not limited.
JNI native memory — memory allocated by native libraries via JNI. The JVM does not track this. Common sources: database drivers (OCI, native JDBC), compression libraries (zlib, snappy), and cryptographic providers. Use NativeMemoryTracking to estimate.
MappedByteBuffer — file-backed memory mapping via FileChannel.map(). Maps file contents directly into process address space. Not counted against heap or MaxDirectMemorySize. Large memory-mapped files can trigger container OOM.
Diagnosis tool: NativeMemoryTracking (NMT). Enable with -XX:NativeMemoryTracking=detail. Query with jcmd <pid> VM.native_memory summary. NMT shows memory breakdown by category: Java Heap, Class (metaspace), Thread, Code, GC, Internal, Symbol, Malloc, and Mapped.
Performance caveat: NMT adds 5-10% overhead in detail mode. Use -XX:NativeMemoryTracking=summary for production (1-2% overhead). Switch to detail mode only during active debugging.
Edge case: Netty's PooledByteBufAllocator recycles direct buffers to avoid allocation overhead. If the pool grows under load, it retains memory even after the buffers are released. Monitor Netty's pool metrics (PooledByteBufAllocator.metric()) to detect pool bloat.
off_heap_monitor.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
package io.thecodeforge.monitoring;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.ThreadMXBean;
import java.nio.ByteBuffer;
import java.util.HashMap;
import java.util.Map;
/**
* Off-HeapMemoryMonitor — tracks memory usage outside
* the JVM heap that contributes to container OOM kills.
*/
publicclassOffHeapMonitor {
publicstaticclassOffHeapReport {
publiclong metaspaceUsedMB;
publiclong metaspaceMaxMB;
publiclong threadStackMB;
publicint threadCount;
publiclong directMemoryMaxMB;
publiclong compressedClassSpaceMB;
publiclong codeCacheMB;
publicMap<String, String> recommendations = newHashMap<>();
publiclongtotalOffHeapMB() {
return metaspaceUsedMB + threadStackMB
+ compressedClassSpaceMB + codeCacheMB;
}
}
publicstaticOffHeapReportanalyze() {
OffHeapReport report = newOffHeapReport();
// Metaspacefor (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
String name = pool.getName();
MemoryUsage usage = pool.getUsage();
if (name.contains("Metaspace")) {
report.metaspaceUsedMB = usage.getUsed() / (1024 * 1024);
report.metaspaceMaxMB = usage.getMax() > 0
? usage.getMax() / (1024 * 1024) : -1;
} elseif (name.contains("Compressed Class Space")) {
report.compressedClassSpaceMB = usage.getUsed() / (1024 * 1024);
} elseif (name.contains("Code Cache")) {
report.codeCacheMB = usage.getUsed() / (1024 * 1024);
}
}
// Thread stacksThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
report.threadCount = threadBean.getThreadCount();
// Estimate: each thread uses -Xss (default ~1MB)// More accurate: check -XX:ThreadStackSize via VM flags
report.threadStackMB = report.threadCount; // Rough estimate: 1MB per thread// Direct memory limittry {
long maxDirectMemory = sun.misc.VM.maxDirectMemory();
report.directMemoryMaxMB = maxDirectMemory / (1024 * 1024);
} catch (Exception e) {
report.directMemoryMaxMB = -1;
}
// Recommendationsif (report.metaspaceUsedMB > 200) {
report.recommendations.put("metaspace",
"Metaspace using " + report.metaspaceUsedMB
+ "MB — check for classloader leaks");
}
if (report.threadCount > 300) {
report.recommendations.put("threads",
report.threadCount + " threads active — "
+ report.threadStackMB + "MB in stacks. "
+ "Consider reducing thread pool size or -Xss.");
}
long totalOffHeap = report.totalOffHeapMB();
if (totalOffHeap > 1024) {
report.recommendations.put("total",
"Total off-heap: " + totalOffHeap + "MB. "
+ "Ensure container memory limit accounts for this.");
}
return report;
}
/**
* Monitor direct buffer allocation rate.
* Callthis periodically to detect direct memory pressure.
*/
publicstaticlonggetDirectMemoryUsedEstimate() {
// NMT is more accurate, but this gives a quick estimate// by attempting a small allocation and checking if it succeedstry {
ByteBuffer test = ByteBuffer.allocateDirect(1024);
test = null;
return -1; // Allocation succeeded — no pressure
} catch (OutOfMemoryError e) {
return 0; // Direct memory exhausted
}
}
}
The Hidden 30% — Off-Heap Memory Budget
Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. This grows if thread pool scales up under load.
Metaspace: 100-256MB typical. Unbounded by default. Set MaxMetaspaceSize to prevent runaway growth.
Direct buffers: tracked by MaxDirectMemorySize. Default equals Xmx. Netty pools can retain memory even after release.
Native memory: JNI libraries, socket buffers, file descriptors. Not tracked by JVM. Use NMT for estimates.
MappedByteBuffer: file-backed mapping. Not counted against heap or direct memory. Large files can trigger container OOM.
Production Insight
A gRPC service using Netty experienced container OOM kills despite heap usage never exceeding 60%. The team was baffled — heap monitoring showed no pressure.
NativeMemoryTracking revealed the issue: Netty's PooledByteBufAllocator had grown to 1.8GB of direct buffers during a traffic spike. The pool retained these buffers even after the gRPC calls completed, waiting for reuse. The container had 4GB limit, 2.4GB heap, 1.8GB Netty pool, 400MB other off-heap = 4.6GB total. Container OOM killer struck.
Cause: Netty pooled allocator retained 1.8GB of direct buffers. Effect: total memory exceeded 4GB container limit. Impact: 4-6 container OOM kills per day during peak traffic. Action: set Netty's PooledByteBufAllocator maxOrder=8 (reduced pool size) and added -XX:MaxDirectMemorySize=512m. Result: direct buffer usage stabilized at 400MB, no OOM kills.
Lesson: Netty's buffer pool is off-heap and invisible to heap monitoring. Always monitor total JVM memory (heap + off-heap), not just heap.
Key Takeaway
Off-heap memory is invisible to heap monitoring but visible to the container OOM killer. Enable NativeMemoryTracking, monitor thread count, and set explicit limits for direct memory and metaspace. The container measures total memory, not just heap.
Off-Heap Memory Troubleshooting
Building a Production Memory Monitoring Stack
Memory incidents are preventable with the right monitoring. The goal is to detect problems hours before they cause OOM — not after.
Layer 1 — JVM metrics (Prometheus/JMX): Expose heap usage, GC pause times, GC count, thread count, and metaspace usage via JMX. Use Micrometer or JMX Exporter for Prometheus integration. Key alerts: - Old gen usage after GC > 70% for 10 minutes → warning - Old gen usage after GC > 85% for 5 minutes → critical - GC pause p99 > 500ms → warning - GC pause p99 > 2s → critical - Thread count > 80% of max pool size → warning - Full GC count > 0 in last hour → investigate
Layer 2 — Container metrics (cAdvisor/Kubernetes): Monitor container memory usage (not just JVM heap). Key alerts: - Container memory > 85% of limit → warning - Container memory > 95% of limit → critical (OOM imminent) - Container restart count > 0 in last hour → investigate
Layer 3 — Application-level metrics: Track object counts for known leak-prone structures: session cache size, connection pool size, thread-local count. These are domain-specific and catch leaks that JVM metrics miss.
Alerting philosophy: Alert on trends, not thresholds. A heap at 80% is fine if it returns to 40% after GC. A heap at 60% is a problem if it never drops below 55% after GC. The post-GC baseline trend is the most important metric.
Automated remediation: For containerized services, configure liveness probes that check heap usage. If heap exceeds 90%, the probe fails and Kubernetes restarts the pod. This is a safety net, not a fix — but it prevents the service from running in a degraded state while you investigate.
Retention and analysis: Keep GC logs and heap dumps for at least 7 days. Memory leaks can take days to manifest. If you only keep 24 hours of logs, you lose the trend data needed for diagnosis. Store dumps in object storage (S3, GCS) with lifecycle policies.
Production insight: the monitoring stack itself must not consume significant memory. A common mistake is running a heavy APM agent (100-200MB overhead) alongside the JVM. In a 2GB heap container, the agent consumes 5-10% of total memory. Use lightweight agents (JMX Exporter <20MB) or expose metrics via an HTTP endpoint without an agent.
Alert on trends: post-GC old gen baseline rising = leak. Post-GC old gen stable = right-sizing issue.
Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses trend data.
Production Insight
A team had comprehensive JVM monitoring (heap, GC, threads) but no container-level monitoring. They experienced intermittent OOM kills that their JVM metrics did not predict. The issue was off-heap growth from a native compression library that consumed 800MB during peak traffic.
Adding container memory monitoring (cAdvisor + Prometheus) immediately revealed the pattern: container memory grew to 95% of limit while heap stayed at 60%. The team added a container memory alert at 85% and got 30 minutes of warning before each OOM kill.
Cause: native compression library allocated 800MB outside JVM heap. Effect: container OOM kills with no JVM-level warning. Impact: 2-3 unexpected restarts per week. Action: added container memory alerting, reduced compression buffer size, increased container limit. Result: zero OOM kills, 30+ minute early warning on memory pressure.
Monitoring overhead: the JMX Exporter added <5MB heap overhead and <0.1% CPU. The alternative (Datadog APM agent) would have added 150MB heap overhead — 7.5% of the 2GB container. Lightweight monitoring is essential in memory-constrained containers.
Key Takeaway
Three layers of monitoring: JVM (heap, GC, threads), container (total memory, OOM kills), and application (caches, pools). Alert on post-GC old gen trends, not absolute values. Keep diagnostic data for 7+ days — leaks take time to manifest.
Memory Monitoring Stack Decisions
● Production incidentPOST-MORTEMseverity: high
The Slow Leak That Killed Black Friday: HashMap Growth Under Concurrent Load
Symptom
Checkout service OOM crashed at 8:47 PM on Black Friday. The heap dump showed 7.8GB of the 8GB heap consumed by a single ConcurrentHashMap inside io.thecodeforge.service.CheckoutSessionManager. The map had 14.2 million entries. Normal baseline was 50,000 entries.
Assumption
The team initially assumed the heap was simply too small for Black Friday traffic. They doubled -Xmx from 4GB to 8GB and redeployed. The service ran for 6 hours before crashing again. The second heap dump showed the same pattern — CheckoutSessionManager holding 14+ million entries.
Root cause
CheckoutSessionManager stored session objects in a ConcurrentHashMap with a user_id key. The session cleanup thread was supposed to evict expired sessions every 60 seconds. Under high load, the cleanup thread was starved — it ran on a shared thread pool with the request handlers. During peak traffic, the request threads consumed all CPU, and the cleanup thread never got scheduled. Sessions accumulated indefinitely. Each session object held references to the full cart, user profile, and payment token — approximately 500 bytes per entry. At 14.2 million entries, that was 7.1GB.
Fix
Replaced the cleanup-thread pattern with Caffeine cache using expireAfterAccess(30, TimeUnit.MINUTES). Caffeine handles eviction internally without a separate thread. Set maximumSize(500_000) as a hard cap. Added a monitoring alert when session count exceeds 100,000. The fix reduced steady-state memory from 4GB to 800MB and eliminated the leak entirely.
Key lesson
Doubling heap without understanding the leak just delays the crash and makes the heap dump twice as large to analyze. Find the leak first, then right-size the heap.
Never use a plain Map for session storage with manual cleanup. Use a cache library (Caffeine, Guava) with built-in TTL eviction.
Background cleanup threads on shared thread pools get starved under load. If eviction is critical, give the cleanup thread a dedicated pool or use a library that does not need one.
Monitor object counts, not just heap usage. A service using 60% heap with 14 million Map entries is in worse shape than one using 80% heap with 50,000 entries.
Set a hard maximumSize on any unbounded collection that receives data from external sources. Unbounded growth is the root cause of most production OOMs.
Production debug guideSymptom-to-action guide for the memory issues you will actually encounter at 2 AM12 entries
Symptom · 01
java.lang.OutOfMemoryError: Java heap space — service crashes with OOM
→
Fix
Check if -XX:+HeapDumpOnOutOfMemoryError was set. If yes, analyze the heap dump with Eclipse MAT or jhat. Look at the dominator tree — the top object consuming memory is usually the leak source. If no heap dump was captured, add the flag immediately and wait for the next occurrence. In the short term, check jstat -gcutil to see if old gen is at 100% and not collecting.
Symptom · 02
java.lang.OutOfMemoryError: Metaspace — service crashes after multiple redeployments
→
Fix
Metaspace stores class metadata. A leak here means classloaders are not being garbage collected. Common in application servers (Tomcat, JBoss) with hot-redeploy. Check if your deployment pipeline redeploys without restarting the JVM. Fix: restart the JVM on redeploy, or investigate why old classloaders are still referenced. Increase -XX:MaxMetaspaceSize only as a temporary mitigation.
Symptom · 03
java.lang.OutOfMemoryError: Direct buffer memory — NIO or Netty service crashes
→
Fix
Direct memory is allocated outside the heap via ByteBuffer.allocateDirect(). The JVM tracks this separately. Check -XX:MaxDirectMemorySize (default is -Xmx value). Common cause: Netty ByteBuf not released, or NIO channels not closed. Use NativeMemoryTracking (NMT) with -XX:NativeMemoryTracking=detail to profile direct memory allocation. In Netty, enable ResourceLeakDetector at PARANOID level temporarily.
Symptom · 04
java.lang.StackOverflowError — thread crashes with deep recursion
→
Fix
Each thread has a fixed stack size set by -Xss (default 512KB-1MB depending on OS). The error means the call stack exceeded this size. Common cause: infinite recursion, or very deep recursive algorithms. Check the stack trace for repeating method signatures. Fix: convert recursion to iteration, or increase -Xss (costs more memory per thread — 1000 threads × 2MB = 2GB extra).
Symptom · 05
GC overhead limit exceeded — service becomes unresponsive, eventually OOM
→
Fix
The JVM spent more than 98% of the last few seconds doing GC and recovered less than 2% of heap. This means GC cannot free enough memory. Root cause is almost always a memory leak — objects are referenced and cannot be collected. Analyze heap dump for leak suspects. Temporary mitigation: increase -Xmx, but this only buys time. The fix is finding and eliminating the leak.
Symptom · 06
Service response time degrades gradually over hours, eventually OOM — no single leak object visible
→
Fix
This is a generational leak pattern. Objects promoted to old gen are never collected, but they are not a single large object — they are thousands of small objects from different code paths. Use jmap -histo:live periodically to track object count growth. Compare histograms over time. The object type with the fastest-growing count is the leak source. Check for unbounded caches, connection pools without limits, or thread-local variables not cleaned up.
Symptom · 07
Container killed by OOM killer (exit code 137) — no JVM OOM error logged
→
Fix
The OS killed the process because total memory (heap + off-heap + native) exceeded the container memory limit. The JVM did not OOM — the container did. Check if -Xmx is set to more than 75% of container memory. The remaining 25% covers thread stacks, metaspace, direct memory, JNI native memory, and OS page cache. Use NativeMemoryTracking to profile total JVM memory usage. Adjust container limit or reduce -Xmx.
Symptom · 08
GC pause times exceed 1 second — service has high tail latency
→
Fix
Full GC is pausing all application threads. Check which GC is active (PrintCommandLineFlags). If using Serial or Parallel GC, switch to G1GC (JDK 11+) or ZGC (JDK 15+). If already using G1GC, tune -XX:MaxGCPauseMillis (default 200ms), -XX:G1HeapRegionSize, and -XX:InitiatingHeapOccupancyPercent. Check if humongous allocations are causing premature GC — objects larger than 50% of a G1 region (default 1MB) are humongous.
Symptom · 09
Heap usage spikes to 90% then drops to 30% — normal or leak?
→
Fix
This is the expected sawtooth pattern IF the drop happens after a full GC cycle and returns to the same baseline. A leak shows as: baseline increases over time. Track old gen usage after each full GC. If post-GC old gen usage grows over hours/days, you have a leak. Use jstat -gcutil <pid> 1000 to monitor. The key metric is old gen usage after full GC, not peak usage.
Symptom · 10
Service runs fine with 100 TPS but OOMs at 1000 TPS — not a leak, just load?
→
Fix
Possibly, but verify. Check if heap usage after GC is the same at both load levels. If post-GC baseline is the same, the issue is allocation rate exceeding GC throughput. Options: increase heap, switch to a lower-latency GC (ZGC/Shenandoah), or reduce allocation rate by object pooling or caching. If post-GC baseline is higher at 1000 TPS, you have a load-dependent leak — objects are referenced longer under concurrency.
Symptom · 11
Memory usage grows slowly over days — no OOM yet, but trending upward
→
Fix
Early leak detection. Take periodic heap histograms with jmap -histo:live and compare. Use JFR (Java Flight Recorder) with -XX:StartFlightRecording to capture allocation patterns over time. Look for object types whose count increases monotonically. Set up monitoring alerts for old gen growth rate — alert if post-GC old gen grows more than 5% per hour.
Symptom · 12
OOM happens only in production, never in staging — same code, same -Xmx
→
Fix
Production data profiles differ from staging. Common causes: production has more unique users (larger session caches), more unique query patterns (larger query caches), or different traffic patterns (more concurrent connections). Compare object counts between environments using jmap -histo. The environment with higher counts reveals the data-dependent leak.
★ Quick Debug Cheat Sheet — Start Here When It Is 2 AMYou are on-call. The service is down. Use this to triage in under 60 seconds before diving deeper.
Pod killed (exit code 137) — no JVM error in logs−
Immediate action
Container OOM killer — total memory exceeded container limit
Commands
kubectl describe pod <pod> | grep -A5 "Last State"
kubectl top pod <pod> --containers
Fix now
Increase container memory limit to 1.43x your -Xmx, or reduce -Xmx to 70% of current container limit
Analyze heap.hprof in Eclipse MAT — check dominator tree for leak suspects
Response times spiking — service is slow but not crashed+
Immediate action
GC pauses are likely killing latency
Commands
jstat -gcutil <pid> 1000
tail -100 /var/log/jvm/gc.log
Fix now
If Full GC count is rising, you have old gen pressure — leak or undersized heap
CPU at 100% — service is thrashing+
Immediate action
Excessive GC or allocation rate exceeding throughput
Commands
jstat -gcutil <pid> 1000
top -Hp <pid>
Fix now
If GC time% > 10%, GC is the bottleneck — fix the leak or increase heap
java.lang.OutOfMemoryError: Metaspace+
Immediate action
Classloader leak — usually from hot-redeploy
Commands
jcmd <pid> VM.classloader_stats
jcmd <pid> GC.class_stats
Fix now
Restart JVM on redeploy. Check for static references to dynamic classloaders
java.lang.OutOfMemoryError: Direct buffer memory+
Immediate action
NIO/Netty buffer allocation failed
Commands
jcmd <pid> VM.native_memory summary
Check -XX:MaxDirectMemorySize setting
Fix now
Enable Netty ResourceLeakDetector at PARANOID. Increase MaxDirectMemorySize or fix buffer leak
java.lang.StackOverflowError+
Immediate action
Infinite recursion or call stack too deep
Commands
jstack <pid> | grep -A 50 "REPEATING_METHOD"
Check stack trace for repeating method signatures
Fix now
Convert recursion to iteration, or increase -Xss (costs more memory per thread)
Service runs for hours then OOM — slow leak+
Immediate action
Memory leak — objects accumulating over time
Commands
jstat -gcutil <pid> 60000 (monitor every minute for 30 min)
jmap -histo:live <pid> > /tmp/histo1.txt (repeat after 1 hour)
Fix now
Diff the two histograms — the object type with the fastest-growing count is the leak source
JVM Memory Issues Compared
Situation
Common Cause
Best Fix
OOM: Java heap space
Memory leak or undersized heap
Analyze heap dump with MAT. Find leak via dominator tree. Fix leak, then right-size heap.
OOM: Metaspace
ClassLoader leak in hot-redeploy environment
Restart JVM on redeploy. Avoid static references to dynamic classloaders. Use WeakHashMap.
OOM: Direct buffer memory
Netty/NIO buffer leak or insufficient MaxDirectMemorySize
Enable ResourceLeakDetector. Set MaxDirectMemorySize explicitly. Monitor with NMT.
GC overhead limit exceeded
Memory leak — GC cannot free enough memory
Analyze heap dump. Fix the leak. Increasing heap only delays the crash.
StackOverflowError
Infinite recursion or deep call stack
Convert recursion to iteration. Increase -Xss if deep recursion is intentional.
Container OOM kill (exit 137)
Total memory (heap + off-heap) exceeds container limit
Set container limit to 1.43x heap. Add NativeMemoryTracking. Monitor container memory.
GC pauses >1 second
Full GC on large heap with G1GC
Switch to ZGC (sub-ms pauses) or tune G1GC MaxGCPauseMillis and IHOP.
Memory grows but no single leak object
Distributed leak (ThreadLocal, unbounded cache)
Compare heap histograms over time. Check ThreadLocal.remove() and cache eviction.
OOM only at high traffic
Allocation rate exceeds GC throughput
Reduce allocation rate (object pooling, caching). Switch to higher-throughput GC.
OOM after code deployment
New code introduced leak or removed cleanup
Diff deployed code. Look for new caches, new ThreadLocal, removed eviction logic.
Heap at 80% but stable — no leak
Working set is legitimately large
Right-size heap. Working set × 2 is a good starting point. Not every high-usage is a leak.
Humongous allocations in GC logs
Objects >50% of G1 region size
Increase G1HeapRegionSize or refactor large byte[]/StringBuilder allocations.
SoftReference cache consuming all heap
JVM only collects SoftReferences under heap pressure
Switch to size-bounded cache (Caffeine) with explicit eviction.
Netty buffer pool growing unbounded
PooledByteBufAllocator retains buffers under load
Set maxOrder limit. Monitor pool metrics. Use -XX:MaxDirectMemorySize.
Key takeaways
1
Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging.
2
Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type.
3
GC tuning is about trade-offs
throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC.
4
Monitor post-GC old gen baseline, not peak heap usage. A rising baseline confirms a leak. Peak usage is irrelevant for leak detection.
5
Set container memory to 1.43x your heap size. Off-heap memory (thread stacks, metaspace, direct buffers) is invisible to heap monitoring but visible to the container OOM killer.
6
ThreadLocal and unbounded caches are the most common production leak sources. Always call ThreadLocal.remove() in a finally block. Always set maximumSize on caches.
7
Three non-negotiable production flags
-XX:+HeapDumpOnOutOfMemoryError, GC logging, and -XX:+ExitOnOutOfMemoryError. Without them, you are flying blind.
8
Enable NativeMemoryTracking to profile off-heap memory. Container OOM kills with normal heap usage indicate off-heap pressure.
9
Netty's PooledByteBufAllocator retains direct buffers even after release. Monitor pool metrics and set explicit MaxDirectMemorySize.
10
Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses the trend data needed for diagnosis.
11
Print the symptom-to-tool map and the five essential commands. When the alert fires at 2 AM, you need to triage in 60 seconds, not 45 minutes.
12
Run fast commands first (jmap -histo, jstat -gcutil). Run slow commands (jmap -dump) only when fast commands do not reveal the issue.
Common mistakes to avoid
20 patterns
×
Treating all OOM errors the same
Symptom
each type (heap, metaspace, direct, GC overhead, stack) has a different cause and fix.
Fix
See production logs and adjust configuration accordingly.
×
Not setting -XX:+HeapDumpOnOutOfMemoryError
Symptom
without it, you have zero diagnostic data after an OOM crash.
Fix
See production logs and adjust configuration accordingly.
×
Setting -Xmx equal to container memory limit
Symptom
the container OOM killer strikes before the JVM OOM handler, leaving no heap dump.
Fix
See production logs and adjust configuration accordingly.
×
Doubling heap size without understanding the leak
Symptom
this just delays the crash and makes the heap dump twice as large to analyze.
Fix
See production logs and adjust configuration accordingly.
×
Monitoring peak heap usage instead of post-GC old gen baseline
Symptom
peak usage depends on allocation rate and GC timing, not leak presence.
Fix
See production logs and adjust configuration accordingly.
×
Using plain HashMap for session storage with manual cleanup
Symptom
use Caffeine or Guava cache with TTL and maximumSize.
Fix
See production logs and adjust configuration accordingly.
×
Calling ThreadLocal.set() without ThreadLocal.remove() in thread pool environments
Symptom
ThreadLocal values persist for the thread's lifetime.
Fix
See production logs and adjust configuration accordingly.
×
Using String.intern() on user-generated input
Symptom
creates an unbounded string pool that grows with every unique string.
Fix
See production logs and adjust configuration accordingly.
×
Not enabling GC logging in production
Symptom
GC logs are essential for diagnosing pause time issues and allocation rate problems.
Fix
See production logs and adjust configuration accordingly.
×
Using a heavy APM agent in memory-constrained containers
Symptom
150MB agent overhead in a 2GB container is 7.5% of total memory.
Fix
See production logs and adjust configuration accordingly.
×
Switching GC collectors without understanding the workload
Symptom
ZGC adds CPU overhead that is wasted on batch jobs that do not care about latency.
Fix
See production logs and adjust configuration accordingly.
×
Not monitoring container memory alongside JVM heap
Symptom
off-heap memory (thread stacks, metaspace, direct buffers) can be 30-50% of total usage.
Fix
See production logs and adjust configuration accordingly.
×
Keeping GC logs and heap dumps for only 24 hours
Symptom
memory leaks take days to manifest, requiring longer retention for trend analysis.
Fix
See production logs and adjust configuration accordingly.
×
Ignoring humongous allocations in G1GC
Symptom
objects >50% of region size cause premature full GC and performance degradation.
Fix
See production logs and adjust configuration accordingly.
×
Setting -XX:MaxMetaspaceSize too low
Symptom
Metaspace OOM is usually a classloader leak, not insufficient space. Fix the leak, not the limit.
Fix
See production logs and adjust configuration accordingly.
×
Not accounting for compressed oops boundary at 32GB
Symptom
a 34GB heap can perform worse than 31GB due to pointer size increase.
Fix
See production logs and adjust configuration accordingly.
×
Using SoftReference-based caches
Symptom
JVM only collects SoftReferences under heap pressure, allowing them to consume all available memory.
Fix
See production logs and adjust configuration accordingly.
×
Running wrong diagnostic tool for the symptom
Symptom
jmap does not help with direct memory, GC logs do not help with stack overflow.
Fix
See production logs and adjust configuration accordingly.
×
Running jmap -dump before jmap -histo
Symptom
histogram is fast and often reveals the problem without needing the slow full dump.
Fix
See production logs and adjust configuration accordingly.
×
Not having a standardized diagnostic script
Symptom
ad-hoc debugging at 2 AM wastes 20+ minutes per incident.
Fix
See production logs and adjust configuration accordingly.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
Walk me through how you would debug an OOM: Java heap space error in a p...
Q02SENIOR
Your service is running in a Kubernetes pod with 4GB memory limit. The p...
Q03JUNIOR
Explain the difference between the five OOM types in the JVM. For each, ...
Q04JUNIOR
You need to reduce GC pause times from 500ms to under 10ms for a latency...
Q05SENIOR
A memory leak in production causes OOM every 20 hours. You take a heap d...
Q06SENIOR
Explain the sawtooth pattern in JVM heap usage. How do you distinguish a...
Q07SENIOR
Your team wants to switch from G1GC to ZGC because they read it is 'fast...
Q08SENIOR
Design a memory monitoring stack for a fleet of 200 Java microservices r...
Q09JUNIOR
A service uses Netty for HTTP handling. Container memory usage grows to ...
Q10SENIOR
You inherit a codebase with 50 ThreadLocal usages across the application...
Q11SENIOR
It is 3 AM and you just received an OOM alert. You have 60 seconds befor...
Q12JUNIOR
Your JVM flags include -Xmx4g and the container memory limit is 4GB. Exp...
Q01 of 12SENIOR
Walk me through how you would debug an OOM: Java heap space error in a production service. What tools would you use, what would you look for, and how would you confirm the fix?
ANSWER
See article sections for a detailed explanation of this concept.
Q02 of 12SENIOR
Your service is running in a Kubernetes pod with 4GB memory limit. The pod gets killed with exit code 137 every few hours, but your JVM heap monitoring shows usage never exceeds 60%. What is happening and how would you fix it?
ANSWER
See article sections for a detailed explanation of this concept.
Q03 of 12JUNIOR
Explain the difference between the five OOM types in the JVM. For each, what is the typical root cause and what diagnostic tool would you use?
ANSWER
See article sections for a detailed explanation of this concept.
Q04 of 12JUNIOR
You need to reduce GC pause times from 500ms to under 10ms for a latency-sensitive trading service. What GC collector would you choose, what parameters would you tune, and what trade-offs would you accept?
ANSWER
See article sections for a detailed explanation of this concept.
Q05 of 12SENIOR
A memory leak in production causes OOM every 20 hours. You take a heap dump but the dominator tree shows no single object dominating memory. How do you find the leak?
ANSWER
See article sections for a detailed explanation of this concept.
Q06 of 12SENIOR
Explain the sawtooth pattern in JVM heap usage. How do you distinguish a healthy sawtooth from a memory leak? What metric do you monitor?
ANSWER
See article sections for a detailed explanation of this concept.
Q07 of 12SENIOR
Your team wants to switch from G1GC to ZGC because they read it is 'faster.' What questions would you ask before approving the change, and what trade-offs would you explain?
ANSWER
See article sections for a detailed explanation of this concept.
Q08 of 12SENIOR
Design a memory monitoring stack for a fleet of 200 Java microservices running in Kubernetes. What metrics would you collect, what alerts would you set, and how would you keep overhead minimal?
ANSWER
See article sections for a detailed explanation of this concept.
Q09 of 12JUNIOR
A service uses Netty for HTTP handling. Container memory usage grows to 95% of limit but heap usage is only 50%. Diagnose the issue and explain the fix.
ANSWER
See article sections for a detailed explanation of this concept.
Q10 of 12SENIOR
You inherit a codebase with 50 ThreadLocal usages across the application. How would you audit them for leaks, and what patterns would you enforce to prevent ThreadLocal leaks in thread pool environments?
ANSWER
See article sections for a detailed explanation of this concept.
Q11 of 12SENIOR
It is 3 AM and you just received an OOM alert. You have 60 seconds before the on-call escalation. Walk me through the exact commands you would run and in what order.
ANSWER
See article sections for a detailed explanation of this concept.
Q12 of 12JUNIOR
Your JVM flags include -Xmx4g and the container memory limit is 4GB. Explain why this is wrong and how you would fix it.
ANSWER
See article sections for a detailed explanation of this concept.
01
Walk me through how you would debug an OOM: Java heap space error in a production service. What tools would you use, what would you look for, and how would you confirm the fix?
SENIOR
02
Your service is running in a Kubernetes pod with 4GB memory limit. The pod gets killed with exit code 137 every few hours, but your JVM heap monitoring shows usage never exceeds 60%. What is happening and how would you fix it?
SENIOR
03
Explain the difference between the five OOM types in the JVM. For each, what is the typical root cause and what diagnostic tool would you use?
JUNIOR
04
You need to reduce GC pause times from 500ms to under 10ms for a latency-sensitive trading service. What GC collector would you choose, what parameters would you tune, and what trade-offs would you accept?
JUNIOR
05
A memory leak in production causes OOM every 20 hours. You take a heap dump but the dominator tree shows no single object dominating memory. How do you find the leak?
SENIOR
06
Explain the sawtooth pattern in JVM heap usage. How do you distinguish a healthy sawtooth from a memory leak? What metric do you monitor?
SENIOR
07
Your team wants to switch from G1GC to ZGC because they read it is 'faster.' What questions would you ask before approving the change, and what trade-offs would you explain?
SENIOR
08
Design a memory monitoring stack for a fleet of 200 Java microservices running in Kubernetes. What metrics would you collect, what alerts would you set, and how would you keep overhead minimal?
SENIOR
09
A service uses Netty for HTTP handling. Container memory usage grows to 95% of limit but heap usage is only 50%. Diagnose the issue and explain the fix.
JUNIOR
10
You inherit a codebase with 50 ThreadLocal usages across the application. How would you audit them for leaks, and what patterns would you enforce to prevent ThreadLocal leaks in thread pool environments?
SENIOR
11
It is 3 AM and you just received an OOM alert. You have 60 seconds before the on-call escalation. Walk me through the exact commands you would run and in what order.
SENIOR
12
Your JVM flags include -Xmx4g and the container memory limit is 4GB. Explain why this is wrong and how you would fix it.
JUNIOR
FAQ · 11 QUESTIONS
Frequently Asked Questions
01
What is the difference between OOM: Java heap space and GC overhead limit exceeded?
OOM: Java heap space means the heap is full and GC cannot free enough space for the current allocation. GC overhead limit exceeded means GC is running continuously (>98% of time) and recovering almost nothing (<2% of heap). Both indicate memory pressure, but GC overhead is the JVM's way of saying 'I tried GC and it did not help — you have a leak.' Fix the leak, do not just increase heap.
Was this helpful?
02
How much memory should I allocate to the JVM in a container?
Set container memory to 1.43x your -Xmx value. For a 4GB heap, set container limit to 5.7GB. The extra 30-43% covers metaspace (~256MB), thread stacks (~200MB for 200 threads), direct memory (~256MB), GC overhead (~200MB), and OS overhead (~500MB). Use -XX:MaxRAMPercentage=70.0 to set heap as 70% of container memory.
Was this helpful?
03
How do I find a memory leak in production?
Step 1: confirm it is a leak by monitoring post-GC old gen baseline — if it rises over hours, it is a leak. Step 2: take two heap dumps 30-60 minutes apart. Step 3: compare histograms in Eclipse MAT to find the fastest-growing object type. Step 4: follow the reference chain to GC root to find who holds the reference. Step 5: fix the reference (add eviction, call remove(), close the resource).
Was this helpful?
04
Should I use G1GC, ZGC, or Parallel GC?
G1GC for most services (good balance). ZGC for latency-sensitive services requiring p99 < 10ms (trading, real-time). Parallel GC for batch jobs where throughput matters and pauses are acceptable. Do not switch to ZGC just because it is newer — it adds CPU overhead that is wasted if you do not need sub-millisecond pauses.
Was this helpful?
05
My container is killed with exit code 137 but no JVM OOM error — what happened?
The Linux OOM killer terminated your process because total memory (heap + off-heap) exceeded the container memory limit. The JVM did not OOM — the OS killed it. Check if -Xmx equals container memory limit (wrong). Increase container limit to 1.43x heap size. Add NativeMemoryTracking to profile off-heap usage.
Was this helpful?
06
How do I diagnose Metaspace OOM?
Metaspace OOM is almost always a classloader leak, not insufficient space. Check if your service uses hot-redeploy without JVM restart. Use jcmd VM.classloader_stats to see classloader counts. Look for classloaders with high class counts that should have been unloaded. The fix is usually avoiding static references to classes from dynamic classloaders.
Was this helpful?
07
What is the sawtooth pattern and how does it help detect leaks?
Healthy JVM: heap rises during allocation, drops after GC, returns to the same baseline each time. Leaking JVM: the post-GC baseline rises over time. Monitor old gen usage after each full GC. If the baseline increases monotonically over hours, you have a leak. The post-GC baseline is the only metric that reveals a leak — peak usage is irrelevant.
Was this helpful?
08
How do I tune GC pause times?
First, determine if pauses are actually a problem — measure p99 latency. If GC pauses exceed your SLA, options: (1) tune G1GC with -XX:MaxGCPauseMillis and -XX:InitiatingHeapOccupancyPercent, (2) switch to ZGC for sub-millisecond pauses, (3) reduce allocation rate to decrease GC frequency, (4) increase heap to give GC more room. Always enable GC logging to measure the impact of changes.
Was this helpful?
09
How do I handle memory in a high-throughput service that allocates a lot of short-lived objects?
Ensure young gen is large enough to hold the working set of short-lived objects. In G1GC, this is automatic. In Parallel GC, tune -XX:NewRatio. Consider object pooling for frequently allocated large objects (but benchmark first — pooling adds complexity and can cause leaks). The most effective optimization is reducing allocation rate: reuse StringBuilder, avoid autoboxing in loops, use primitive collections.
Was this helpful?
10
What JVM flags are essential for production?
Non-negotiable: -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/log/jvm/, GC logging (-Xlog:gc*), -XX:+ExitOnOutOfMemoryError. Recommended: -XX:MaxRAMPercentage=70.0 (container), -XX:MaxMetaspaceSize=256m, JFR continuous recording. These flags turn production incidents from guesswork into diagnosis.
Was this helpful?
11
What are the five essential debug commands for a JVM memory incident?
jcmd VM.native_memory summary — shows where all JVM memory is going. 2) jmap -histo:live | head -30 — shows top 30 object types by count and size. 3) jstat -gcutil <pid> 1000 — shows GC activity in real-time. 4) jcmd GC.heap_dump — full heap dump for MAT analysis. 5) jstack <pid> — thread dump for StackOverflowError and ThreadLocal leaks. Run in this order — fast commands first.