Junior 21 min · April 4, 2026

JVM Memory Issues in Production: Debugging Guide (OOM, GC, Leaks)

14.2 Million ConcurrentHashMap Entries — JVM OOM Debug

Q: What is the difference between OOM: Java heap space and GC overhead limit exceeded?

OOM: Java heap space means the heap is full and GC cannot free enough space for the current allocation. GC overhead limit exceeded means GC is running continuously (>98% of time) and recovering almost nothing (<2% of heap). Both indicate memory pressure, but GC overhead is the JVM's way of saying 'I tried GC and it did not help — you have a leak.' Fix the leak, do not just increase heap.

Q: How much memory should I allocate to the JVM in a container?

Set container memory to 1.43x your -Xmx value. For a 4GB heap, set container limit to 5.7GB. The extra 30-43% covers metaspace (~256MB), thread stacks (~200MB for 200 threads), direct memory (~256MB), GC overhead (~200MB), and OS overhead (~500MB). Use -XX:MaxRAMPercentage=70.0 to set heap as 70% of container memory.

Q: How do I find a memory leak in production?

Step 1: confirm it is a leak by monitoring post-GC old gen baseline — if it rises over hours, it is a leak. Step 2: take two heap dumps 30-60 minutes apart. Step 3: compare histograms in Eclipse MAT to find the fastest-growing object type. Step 4: follow the reference chain to GC root to find who holds the reference. Step 5: fix the reference (add eviction, call remove(), close the resource).

Q: Should I use G1GC, ZGC, or Parallel GC?

G1GC for most services (good balance). ZGC for latency-sensitive services requiring p99 < 10ms (trading, real-time). Parallel GC for batch jobs where throughput matters and pauses are acceptable. Do not switch to ZGC just because it is newer — it adds CPU overhead that is wasted if you do not need sub-millisecond pauses.

Q: My container is killed with exit code 137 but no JVM OOM error — what happened?

The Linux OOM killer terminated your process because total memory (heap + off-heap) exceeded the container memory limit. The JVM did not OOM — the OS killed it. Check if -Xmx equals container memory limit (wrong). Increase container limit to 1.43x heap size. Add NativeMemoryTracking to profile off-heap usage.

Q: How do I diagnose Metaspace OOM?

Metaspace OOM is almost always a classloader leak, not insufficient space. Check if your service uses hot-redeploy without JVM restart. Use jcmd VM.classloader_stats to see classloader counts. Look for classloaders with high class counts that should have been unloaded. The fix is usually avoiding static references to classes from dynamic classloaders.

Q: What is the sawtooth pattern and how does it help detect leaks?

Healthy JVM: heap rises during allocation, drops after GC, returns to the same baseline each time. Leaking JVM: the post-GC baseline rises over time. Monitor old gen usage after each full GC. If the baseline increases monotonically over hours, you have a leak. The post-GC baseline is the only metric that reveals a leak — peak usage is irrelevant.

Q: How do I tune GC pause times?

First, determine if pauses are actually a problem — measure p99 latency. If GC pauses exceed your SLA, options: (1) tune G1GC with -XX:MaxGCPauseMillis and -XX:InitiatingHeapOccupancyPercent, (2) switch to ZGC for sub-millisecond pauses, (3) reduce allocation rate to decrease GC frequency, (4) increase heap to give GC more room. Always enable GC logging to measure the impact of changes.

Q: How do I handle memory in a high-throughput service that allocates a lot of short-lived objects?

Ensure young gen is large enough to hold the working set of short-lived objects. In G1GC, this is automatic. In Parallel GC, tune -XX:NewRatio. Consider object pooling for frequently allocated large objects (but benchmark first — pooling adds complexity and can cause leaks). The most effective optimization is reducing allocation rate: reuse StringBuilder, avoid autoboxing in loops, use primitive collections.

Q: What JVM flags are essential for production?

Non-negotiable: -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/log/jvm/, GC logging (-Xlog:gc*), -XX:+ExitOnOutOfMemoryError. Recommended: -XX:MaxRAMPercentage=70.0 (container), -XX:MaxMetaspaceSize=256m, JFR continuous recording. These flags turn production incidents from guesswork into diagnosis.

A single ConcurrentHashMap held 14.2 million entries in 8GB heap — reveals how starved cleanup threads cause production OOM, fix it before Black Friday..

Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Lessons pulled from things that broke in production.

✓ Production

production tested

July 04, 2026

last updated

1,663

articles · all by Naren

Before you start⏱ 25 min

✓Solid grasp of fundamentals
✓Comfortable reading code examples
✓Basic production concepts

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

OOM types: 5 distinct types: Heap Space, Metaspace, Direct Memory, Stack Overflow, and GC Overhead Limit. Each has a different root cause and fix — treat them separately.
Always capture heap dumps: Set -XX:+HeapDumpOnOutOfMemoryError at startup. Without it you are guessing. The dump is the only reliable way to find what was in memory at crash time.
GC pause threshold: Pauses above 200ms in latency-sensitive services indicate a tuning problem. Switch collectors or adjust generation ratios before increasing heap size.
Memory leak signal: A sawtooth heap pattern that never returns to baseline after GC. Analyze the dominator tree in your heap dump to find the root retaining object.
Metaspace OOM: Usually a classloader leak in hot-redeploy environments — not insufficient Metaspace size. Increasing MaxMetaspaceSize just delays the same failure.
Container sizing rule: Set -Xmx to 70–75% of container memory. The remaining 25–30% covers off-heap, thread stacks, metaspace, and OS page cache.

✦ Definition~90s read

What is JVM Memory Issues in Production?

JVM memory debugging is the systematic process of diagnosing and resolving OutOfMemoryError (OOM) conditions in Java applications. It's not academic theory — when your production service starts throwing java.lang.OutOfMemoryError: Java heap space at 3 AM with 14.2 million ConcurrentHashMap entries consuming 8GB, you need a repeatable methodology.

★

JVM memory issues are like a warehouse that keeps filling up.

This article walks through that exact scenario: a real-world case where a seemingly innocent ConcurrentHashMap grew unbounded because a cache eviction policy was missing, taking down a payment processing service. You'll learn the five OOM types (Java heap space, GC overhead limit exceeded, Metaspace, unable to create new native thread, and direct buffer memory) and exactly which tool to reach for based on symptoms — heap dumps for object retention analysis, GC logs for allocation rate problems, and native memory tracking for off-heap leaks.

The debugging quick map gives you a symptom-to-tool decision tree: high GC overhead → enable -XX:+PrintGCDetails and analyze with GCeasy; slow memory growth → take sequential heap dumps with jmap -dump:live and diff with Eclipse MAT; native memory exhaustion → use -XX:NativeMemoryTracking=summary and jcmd VM.native_memory. This isn't about theory — it's about knowing that when you see 14.2 million entries in a single ConcurrentHashMap, you're looking at a data structure that's consuming roughly 2.3GB just for the map nodes (each entry is ~168 bytes on 64-bit JVM with compressed OOPs), and the fix is either adding a maximumSize with Guava's CacheBuilder or switching to a bounded ConcurrentLinkedHashMap.

The article also covers GC tuning trade-offs: why G1GC with -XX:MaxGCPauseMillis=200 and -XX:G1HeapRegionSize=16m handles large heaps better than ParallelGC, and when ZGC's sub-millisecond pause times justify its 15% CPU overhead. If you're running services with heaps over 4GB, this debugging workflow will save you from the next 3 AM pager alert.

Plain-English First

JVM memory issues are like a warehouse that keeps filling up. The garbage collector is the cleanup crew — if they cannot keep up, the warehouse overflows (OOM). Memory leaks are boxes that nobody ever throws away because someone keeps holding a reference to them. GC tuning is about hiring the right cleanup crew and giving them the right schedule. The key is knowing which type of overflow you have — is it the main warehouse (heap), the filing cabinet (metaspace), the loading dock (direct memory), or the office desks (thread stacks)?

JVM memory failures are the most common cause of unplanned downtime in Java-based production systems. An OOM kill at 2 AM takes down the service, triggers alerts, and forces on-call engineers to diagnose under pressure.

Most OOM errors are preventable. The JVM provides extensive diagnostics — heap dumps, GC logs, JFR recordings — but teams rarely configure them before the incident. By the time the OOM fires, the evidence is already gone unless you captured it proactively.

This guide covers the five OOM types, GC tuning trade-offs, memory leak detection patterns, and the production configurations that prevent most memory-related outages. Every pattern comes from systems running at scale — not textbook examples.

Start with the Quick Debug Cheat Sheet above if you are actively debugging an incident. Use the sections below for deep understanding and prevention.

Why JVM Memory Debugging Is Not Optional

JVM memory debugging is the systematic process of identifying why a Java application consumes more heap than expected, often leading to OutOfMemoryError (OOM). The core mechanic involves capturing heap dumps, analyzing object retention paths, and measuring allocation rates to pinpoint the exact data structures or code paths responsible. Without this discipline, a single ConcurrentHashMap with 14.2 million entries can silently exhaust a 16 GB heap.

In practice, memory debugging relies on two key properties: object reachability from GC roots and allocation frequency. Tools like Eclipse MAT or JProfiler compute retained heap — the memory that would be freed if an object were garbage collected. This reveals that a seemingly small map entry (key+value+overhead ~200 bytes) multiplied by millions becomes gigabytes. The real insight often lies in unexpected retention chains, not just raw object counts.

Use memory debugging when your application shows gradual heap growth, frequent Full GCs, or crashes with OOM. It matters most in production systems with high concurrency or caching layers, where a single unbounded data structure can bring down a service. Teams that skip this step often mistake memory leaks for normal load spikes, leading to costly autoscaling instead of a 10-line fix.

Retained Heap vs. Shallow Heap

Shallow heap is the object's own size; retained heap includes everything it keeps alive. A ConcurrentHashMap may have small shallow size but huge retained heap due to millions of entries.

Production Insight

A payment processing service cached transaction metadata in a ConcurrentHashMap without eviction — 14.2 million entries after 3 days.

Symptom: JVM crashed with 'Java heap space' OOM during peak hours, GC logs showed 95% time spent in Full GC.

Rule: Always bound in-memory caches with size limits (e.g., Guava Cache) or use weak references for ephemeral data.

Key Takeaway

Memory debugging is about finding what holds references, not just what uses memory.

A single unbounded collection is the most common cause of production OOMs.

Heap dump analysis is the only reliable way to distinguish a leak from a legit memory spike.

thecodeforge.io

Jvm Memory Debugging

Production Debugging Quick Map — Symptom to Tool

When a memory incident fires, you need to go from symptom to correct diagnostic tool in seconds. This map is designed to be printed and taped to your monitor.

The key insight: each symptom points to a specific memory region and a specific tool. Using the wrong tool wastes hours. A heap dump does not help with direct memory OOM. GC logs do not help with stack overflow. Match the symptom to the tool.

Decision flow: 1. Read the error message or symptom. 2. Find the matching row in the table below. 3. Run the diagnostic command. 4. Apply the fix.

Severity triage: - Service crashed (OOM) → critical — capture diagnostics immediately - Service degraded (slow) → high — capture diagnostics within 15 minutes - Service trending toward OOM → medium — schedule diagnostics within 1 hour - No symptoms, proactive check → low — run diagnostics during maintenance window

The table below covers the 12 most common production memory scenarios. Each row maps symptom → what to check → which tool → immediate action. This is the scan-first view — use it before reading any section in detail.

Production insight: the most time-consuming part of memory debugging is choosing the right tool. Engineers waste hours running jmap when they should be reading GC logs, or analyzing heap dumps when the issue is off-heap. This table eliminates that wasted time by mapping symptoms directly to tools.

symptom_tool_map.txtTEXT

SYMPTOM                          | WHAT TO CHECK                  | TOOL                           | IMMEDIATE ACTION
---------------------------------+--------------------------------+--------------------------------+-------------------------------------------
Exit code 137 (no JVM error)     | Container memory vs heap       | kubectl top + jcmd VM.native   | Increase container to 1.43x heap
OOM: Java heap space             | Heap contents (what is big?)   | jmap -histo + Eclipse MAT      | Find leak via dominator tree
OOM: Metaspace                   | Classloader count              | jcmd VM.classloader_stats      | Restart JVM, fix classloader leak
OOM: Direct buffer memory        | Direct buffer allocation       | jcmd VM.native_memory summary  | Fix buffer leak, set MaxDirectMemorySize
OOM: GC overhead limit           | GC frequency + old gen usage   | jstat -gcutil + GC logs        | Fix memory leak (GC cannot free enough)
StackOverflowError               | Call stack depth               | jstack <pid>                   | Convert recursion to iteration
Latency spikes (no OOM)          | GC pause times                 | GC logs (-Xlog:gc)             | Tune GC or switch collector
CPU high + slow response         | GC time percentage             | jstat -gcutil <pid> 1s         | If GC time > 5%, fix leak or increase heap
Memory grows over hours          | Old gen trend (post-GC)        | jstat -gcutil + jmap -histo    | Compare histograms, find growing types
OOM only at high traffic         | Allocation rate                | JFR (settings=profile)         | Reduce allocation rate or increase heap
OOM only in production           | Object count comparison        | jmap -histo (prod vs staging)  | Find data-dependent leak
OOM after code deploy            | Code diff (new caches/threads) | git diff + heap dump           | Check for removed eviction logic

The 60-Second Triage Rule

Error message contains 'heap space' → heap region → jmap + MAT → leak or undersized heap.
Error message contains 'Metaspace' → class metadata → jcmd classloader_stats → classloader leak.
Error message contains 'Direct buffer' → off-heap NIO → jcmd native_memory → buffer leak.
Error message contains 'GC overhead' → GC cannot free → heap dump → memory leak confirmed.
No error message, just exit code 137 → container limit → kubectl top → off-heap exceeded container limit.
No crash, just slow → GC pauses → GC logs → collector tuning or leak.

Production Insight

An on-call engineer received an OOM alert at 3 AM. The error was 'java.lang.OutOfMemoryError: Java heap space.' The engineer ran jcmd VM.native_memory summary (wrong tool — that is for off-heap). The output showed nothing unusual. Then they ran kubectl top pod (wrong tool — that is for container-level). Still nothing. Then they ran jstat -gcutil (useful but not sufficient). After 45 minutes of wrong tools, they finally ran jmap -histo:live and found a HashMap with 8 million entries in 30 seconds.

Cause: mismatched symptom-to-tool mapping. Effect: 45 minutes of wasted debugging time during a 3 AM incident. Impact: extended outage, delayed root cause identification. Action: printed the symptom-to-tool map and taped it to every engineer's monitor. Result: subsequent incidents triaged in under 60 seconds.

The lesson: having the right tools is not enough. You need the right tool for the right symptom. A cheat sheet that maps symptoms to tools eliminates the most common source of debugging delays.

Key Takeaway

Symptom determines the memory region. Memory region determines the tool. Tool determines the root cause. Print the symptom-to-tool map and eliminate 45 minutes of wrong-tool debugging during incidents.

Which Tool to Use for Each Memory Region

IfSuspected heap leak (heap space OOM, growing old gen)

→

Usejmap -histo:live for object counts, jmap -dump for heap dump analysis in MAT. Use jstat -gcutil to confirm old gen growth trend.

IfSuspected off-heap issue (container OOM, direct buffer OOM)

→

Usejcmd VM.native_memory summary for breakdown. kubectl top pod for total container usage. Check -XX:MaxDirectMemorySize.

IfSuspected GC problem (latency spikes, GC overhead OOM)

→

UseGC logs (-Xlog:gc*) for pause times and frequency. jstat -gcutil for real-time GC activity. Check collector type with PrintCommandLineFlags.

IfSuspected classloader leak (Metaspace OOM)

→

Usejcmd VM.classloader_stats for classloader counts. Check for hot-redeploy without JVM restart.

IfSuspected thread issue (StackOverflowError, high thread count)

→

Usejstack for thread dump. ThreadMXBean.getThreadCount() for thread count. Check -Xss setting.

IfSuspected allocation rate issue (OOM only at high traffic)

→

UseJFR with settings=profile for allocation hotspots. jstat -gcutil for allocation rate estimation. Check GC logs for promotion rate.

Essential JVM Debug Commands — Complete Reference

Every production JVM memory incident requires specific commands. This section is the complete reference — categorized by tool, with exact syntax and what to look for in the output.

These commands assume JDK 11+ syntax. For JDK 8, some flags differ (noted where applicable).

Critical rule: always run diagnostic commands as the same user that owns the JVM process. In containers, exec into the container: kubectl exec -it <pod> -- /bin/bash.

jcmd — the Swiss Army knife. Replaces jinfo, jmap, jstack, and jstat for most operations. Available on all JDK 11+ installations. One tool, many functions.

jmap — heap dump and histogram. The primary tool for heap analysis. jmap -histo:live forces a full GC before counting, showing only live objects. jmap -dump:live creates a heap dump file for MAT analysis.

jstat — real-time GC monitoring. Shows GC activity in real-time without stopping the JVM. The -gcutil flag shows usage percentages for each generation. Run with 1-second interval for live debugging.

jstack — thread dump. Shows all threads and their stack traces. Essential for StackOverflowError and thread-related memory issues (ThreadLocal accumulation).

JFR — Java Flight Recorder. Low-overhead continuous profiling. Captures allocation patterns, GC events, and lock contention. Can run in production with <2% overhead.

Production insight: the most commonly confused commands are jmap -histo (object counts, fast) and jmap -dump (full heap dump, slow, pauses JVM). Use -histo first to get a quick overview. Only use -dump when you need the full object graph for MAT analysis. Dumping a 16GB heap pauses the JVM for 10-30 seconds.

Edge case: in Kubernetes, the JVM process PID is usually 1 (the container entrypoint). If your container runs a wrapper script, the JVM PID may be different. Use ps aux | grep java to find the actual PID. Some commands require JAVA_HOME to be set — verify with echo $JAVA_HOME before running.

jvm_debug_commands.shBASH

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

#!/bin/bash
# ============================================================
# JVM Debug Commands — Production Reference
# Run from inside the container or on the host with JVM access
# ============================================================

PID=$(pgrep -f 'java.*-Xmx')  # Find JVM PID

# ============================================================
# JCMD — Swiss Army Knife (JDK 11+)
# ============================================================

# List all JVM processes
jcmd

# JVM summary (uptime, arguments, heap config)
jcmd $PID VM.info

# Native memory breakdown (heap, thread, class, GC, direct)
jcmd $PID VM.native_memory summary
jcmd $PID VM.native_memory summary.diff  # Since last baseline
jcmd $PID VM.native_memory baseline      # Set baseline for diff

# Classloader statistics (class count, classloader count)
jcmd $PID VM.classloader_stats

# GC class statistics (instance count and size by class)
jcmd $PID GC.class_stats | head -20

# Force full GC
jcmd $PID GC.run

# Print all VM flags
jcmd $PID VM.flags -all | grep -E '(HeapDump|GC|Metaspace|DirectMemory|ThreadStackSize)'

# Print command line flags (shows effective GC settings)
jcmd $PID VM.command_line

# Thread dump (replaces jstack)
jcmd $PID Thread.print

# Heap dump
jcmd $PID GC.heap_dump /tmp/heap.hprof

# Heap histogram (live objects only, forces GC)
jcmd $PID GC.class_histogram | head -30

# JFR: start recording
jcmd $PID JFR.start name=debug settings=profile maxsize=100M maxage=1h

# JFR: dump recording
jcmd $PID JFR.dump name=debug filename=/tmp/recording.jfr

# JFR: stop recording
jcmd $PID JFR.stop name=debug

# ============================================================
# JMAP — Heap Dump and Histogram
# ============================================================

# Histogram of live objects (top 30 by count)
jmap -histo:live $PID | head -30

# Histogram of all objects (including unreachable — faster, no GC)
jmap -histo $PID | head -30

# Full heap dump (live objects only — forces GC first)
jmap -dump:live,format=b,file=/tmp/heap.hprof $PID

# Full heap dump (all objects — faster but larger file)
jmap -dump:format=b,file=/tmp/heap_all.hprof $PID

# ============================================================
# JSTAT — Real-Time GC Monitoring
# ============================================================

# GC utilization every 1 second, 10 samples
jstat -gcutil $PID 1000 10

# Output columns:
# S0  — Survivor 0 usage %
# S1  — Survivor 1 usage %
# E   — Eden usage %
# O   — Old gen usage %  ← KEY METRIC for leak detection
# M   — Metaspace usage %
# CCS — Compressed class space usage %
# YGC — Young GC count
# YGCT — Young GC total time (seconds)
# FGC — Full GC count         ← SHOULD BE 0 in healthy service
# FGCT — Full GC total time (seconds)
# GCT — Total GC time (seconds)

# Key diagnostic:
# If O (old gen) keeps growing after GC → memory leak
# If FGC > 0 and increasing → old gen pressure
# If GCT/uptime > 5% → GC overhead problem

# ============================================================
# JSTACK — Thread Dump
# ============================================================

# Full thread dump
jstack $PID > /tmp/threads.txt

# Thread dump with lock information
jstack -l $PID > /tmp/threads_locked.txt

# Count threads by state (useful for thread leak detection)
jstack $PID | grep "java.lang.Thread.State" | sort | uniq -c | sort -rn

# ============================================================
# KUBERNETES / CONTAINER COMMANDS
# ============================================================

# Pod memory usage
kubectl top pod <pod-name> --containers

# Pod memory limits and usage
kubectl describe pod <pod-name> | grep -A 10 "Limits\|Requests"

# Container OOM kill events
kubectl get events --field-selector reason=OOMKilling

# Exec into running container
kubectl exec -it <pod-name> -- /bin/bash

# Check container memory limit from inside container
cat /sys/fs/cgroup/memory/memory.limit_in_bytes  # cgroup v1
cat /sys/fs/cgroup/memory.max                     # cgroup v2

# Check container memory usage from inside container
cat /sys/fs/cgroup/memory/memory.usage_in_bytes   # cgroup v1
cat /sys/fs/cgroup/memory.current                  # cgroup v2

# ============================================================
# QUICK DIAGNOSTIC SEQUENCE (Run this for any OOM)
# ============================================================

echo "=== Quick JVM Memory Diagnostic ==="
echo "PID: $PID"
echo ""
echo "--- 1. JVM Flags ---"
jcmd $PID VM.flags -all | grep -E '(MaxHeap|MaxMetaspace|MaxDirect|ThreadStack|GC)'
echo ""
echo "--- 2. Native Memory Summary ---"
jcmd $PID VM.native_memory summary
echo ""
echo "--- 3. Heap Histogram (top 15) ---"
jmap -histo:live $PID | head -15
echo ""
echo "--- 4. GC Status ---"
jstat -gcutil $PID 1000 5
echo ""
echo "--- 5. Thread Count ---"
jcmd $PID Thread.print | grep "java.lang.Thread.State" | wc -l
echo ""
echo "=== Diagnostic Complete ==="

The Five Commands You Need at 2 AM

jcmd $PID VM.native_memory summary — shows where all JVM memory is going (heap, threads, metaspace, direct).
jmap -histo:live $PID | head -30 — shows top 30 object types by count and size. Fast, no heap dump needed.
jstat -gcutil $PID 1000 — shows GC activity in real-time. Old gen growing = leak. Full GC count rising = pressure.
jcmd $PID GC.heap_dump /tmp/heap.hprof — full heap dump for MAT analysis. Pauses JVM — use only when needed.
jstack $PID — thread dump for StackOverflowError and ThreadLocal leak detection.

Production Insight

A team had no standardized debugging process for memory incidents. Each engineer used different commands in different order. One engineer spent 20 minutes trying to find the JVM PID. Another ran jmap -dump (slow, pauses JVM) before running jmap -histo (fast, no pause) — the dump took 3 minutes on a 16GB heap and the service became unresponsive.

The team created a standardized diagnostic script that runs the five essential commands in the correct order: flags (5 seconds), native memory (5 seconds), histogram (10 seconds), GC status (5 seconds), thread count (5 seconds). Total time: 30 seconds. The script runs automatically when an OOM alert fires.

Cause: no standardized diagnostic process. Effect: 20+ minutes of ad-hoc debugging per incident, wrong command order causing service disruption. Impact: extended outages, on-call burnout. Action: created automated diagnostic script, printed command cheat sheet. Result: 30-second diagnostic baseline, consistent debugging across all engineers.

Key insight: the order matters. Run fast commands first (flags, histogram, GC status). Run slow commands only if fast commands do not reveal the issue. Never run jmap -dump before jmap -histo — the histogram often reveals the problem without needing the full dump.

Key Takeaway

Five commands cover 95% of memory incidents: native_memory summary, jmap -histo, jstat -gcutil, GC.heap_dump, and Thread.print. Run fast commands first. Never dump before histogram. Print the cheat sheet.

Which Command to Run First

IfJust received OOM alert — need quick triage

→

UseRun jcmd VM.native_memory summary (5 sec) + jmap -histo:live | head -30 (10 sec). Total 15 seconds. This covers 80% of incidents.

IfHistogram shows no dominant object — need full analysis

→

UseRun jcmd GC.heap_dump /tmp/heap.hprof. Analyze in Eclipse MAT. Check dominator tree and histogram comparison.

IfService is slow but not crashed — suspect GC

→

UseRun jstat -gcutil $PID 1000 for 30 seconds. If old gen is full and Full GC count is rising, you have old gen pressure.

IfStackOverflowError or thread-related issue

→

UseRun jstack $PID. Look for repeating method signatures in the stack trace. Count threads by state.

IfContainer OOM kill (exit 137) — no JVM error

→

UseRun kubectl describe pod + kubectl top pod. Then run jcmd VM.native_memory summary inside the container to profile off-heap.

IfNeed continuous profiling without stopping the service

→

UseStart JFR: jcmd $PID JFR.start settings=profile maxage=1h. Dump on demand: jcmd $PID JFR.dump filename=/tmp/rec.jfr. Overhead <2%.

thecodeforge.io

Jvm Memory Debugging

Understanding the Five OOM Types

Most developers treat OOM as a single error. It is not. The JVM has five distinct OOM conditions, each with different causes, diagnostics, and fixes. Treating them interchangeably leads to misdiagnosis.

Java heap space — the most common. The heap (young gen + old gen) is full and GC cannot free enough space. Almost always a memory leak or undersized heap.

Metaspace — class metadata storage is full. Common in hot-redeploy environments where classloaders accumulate. Rarely a sizing issue — almost always a classloader leak.

Direct buffer memory — off-heap NIO buffer allocation failed. Common in Netty, gRPC, and NIO-based services. Usually a buffer leak or insufficient MaxDirectMemorySize.

GC overhead limit exceeded — GC is running continuously and recovering almost nothing. The JVM's way of saying 'I tried GC, it did not help, you have a leak.' This is a leak indicator, not a sizing issue.

Stack overflow — thread call stack exceeded -Xss. Not a memory leak — it is a recursion depth problem. But it manifests as an OOM in monitoring.

The critical insight: each OOM type requires a different diagnostic approach. A heap dump does not help with Metaspace OOM. Increasing -Xmx does not fix direct buffer memory OOM. Matching the OOM type to the correct diagnostic tool is the first step.

Production edge case: some OOM types are caught by the JVM (heap space, metaspace), while others kill the process externally. Container OOM killer (exit code 137) bypasses the JVM entirely — no heap dump, no error message, just a dead process. This is why container memory limits must account for off-heap usage.

Performance implication: each OOM type has different latency characteristics. Heap OOM causes gradual degradation (GC pauses increase). Metaspace OOM is sudden (class loading fails). Direct memory OOM is sudden (buffer allocation fails). Stack overflow is immediate (thread dies). Understanding the failure mode helps you detect it earlier.

oom_type_detector.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

package io.thecodeforge.monitoring;

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.GarbageCollectorMXBean;
import java.util.List;
import java.util.Map;
import java.util.HashMap;

/**
 * OOM Type Detector — identifies which memory region is at risk
 * before an OOM occurs.
 */
public class OomTypeDetector {

    private static final double HEAP_WARNING_THRESHOLD = 0.80;
    private static final double HEAP_CRITICAL_THRESHOLD = 0.90;
    private static final double METASPACE_WARNING_THRESHOLD = 0.80;

    public enum RiskLevel {
        HEALTHY, WARNING, CRITICAL, IMMINENT
    }

    public enum OomType {
        HEAP_SPACE,
        METASPACE,
        DIRECT_BUFFER,
        GC_OVERHEAD,
        STACK_OVERFLOW,
        CONTAINER_LIMIT
    }

    public static class MemoryRiskReport {
        public RiskLevel heapRisk;
        public RiskLevel metaspaceRisk;
        public RiskLevel gcOverheadRisk;
        public Map<OomType, String> recommendations;
        public long heapUsedMB;
        public long heapMaxMB;
        public long metaspaceUsedMB;
        public long metaspaceMaxMB;
        public double gcTimePercent;

        public MemoryRiskReport() {
            recommendations = new HashMap<>();
        }
    }

    public static MemoryRiskReport analyze() {
        MemoryRiskReport report = new MemoryRiskReport();
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();

        // Heap analysis
        MemoryUsage heapUsage = memBean.getHeapMemoryUsage();
        report.heapUsedMB = heapUsage.getUsed() / (1024 * 1024);
        report.heapMaxMB = heapUsage.getMax() / (1024 * 1024);
        double heapPercent = (double) heapUsage.getUsed() / heapUsage.getMax();

        if (heapPercent >= HEAP_CRITICAL_THRESHOLD) {
            report.heapRisk = RiskLevel.CRITICAL;
            report.recommendations.put(OomType.HEAP_SPACE,
                "Heap at " + (int)(heapPercent * 100) + "% — capture heap dump and analyze dominator tree.");
        } else if (heapPercent >= HEAP_WARNING_THRESHOLD) {
            report.heapRisk = RiskLevel.WARNING;
            report.recommendations.put(OomType.HEAP_SPACE,
                "Heap at " + (int)(heapPercent * 100) + "% — monitor growth rate.");
        } else {
            report.heapRisk = RiskLevel.HEALTHY;
        }

        // Metaspace analysis
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            if (pool.getName().contains("Metaspace")) {
                MemoryUsage usage = pool.getUsage();
                report.metaspaceUsedMB = usage.getUsed() / (1024 * 1024);
                report.metaspaceMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : -1;
                if (report.metaspaceMaxMB > 0) {
                    double metaPercent = (double) usage.getUsed() / usage.getMax();
                    if (metaPercent >= METASPACE_WARNING_THRESHOLD) {
                        report.metaspaceRisk = RiskLevel.WARNING;
                        report.recommendations.put(OomType.METASPACE,
                            "Metaspace at " + (int)(metaPercent * 100)
                            + "% — check for classloader leaks.");
                    } else {
                        report.metaspaceRisk = RiskLevel.HEALTHY;
                    }
                }
            }
        }

        // GC overhead analysis
        long totalGcTimeMs = 0;
        long totalGcCount = 0;
        for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) {
            totalGcTimeMs += gc.getCollectionTime();
            totalGcCount += gc.getCollectionCount();
        }
        long uptimeMs = ManagementFactory.getRuntimeMXBean().getUptime();
        report.gcTimePercent = (double) totalGcTimeMs / uptimeMs * 100;

        if (report.gcTimePercent > 5.0) {
            report.gcOverheadRisk = RiskLevel.CRITICAL;
            report.recommendations.put(OomType.GC_OVERHEAD,
                "GC consuming " + String.format("%.1f", report.gcTimePercent)
                + "% of uptime — likely memory leak. Capture heap dump.");
        } else if (report.gcTimePercent > 2.0) {
            report.gcOverheadRisk = RiskLevel.WARNING;
        } else {
            report.gcOverheadRisk = RiskLevel.HEALTHY;
        }

        return report;
    }
}

The Five OOM Types — Each Needs a Different Diagnostic

Heap space: heap dump (jmap, -XX:+HeapDumpOnOutOfMemoryError). Look at dominator tree for leak suspects.
Metaspace: classloader analysis (jcmd VM.classloader_stats). Look for classloaders with high class count that should have been unloaded.
Direct buffer: NativeMemoryTracking (-XX:NativeMemoryTracking=detail, jcmd VM.native_memory). Look for buffer allocation without corresponding release.
GC overhead: heap dump + GC log analysis. The leak is in old gen — look for objects that survive full GC.
Stack overflow: thread dump (jstack). Look for repeating method signatures indicating infinite recursion.

Production Insight

A microservices team spent 3 days debugging a Metaspace OOM by increasing MaxMetaspaceSize from 256MB to 1GB. The OOM returned after 2 days. The real issue was a classloader leak caused by a reflection-based plugin system that cached Class objects in a static HashMap. Each redeployment loaded new classes but the old Class references were never released. The static HashMap grew indefinitely.

Cause: static HashMap caching Class objects from dynamically loaded classloaders. Effect: old classloaders could not be GC'd because the static map held references. Metaspace grew by 50MB per redeployment. Impact: service crashed every 2-3 days. Action: replaced static HashMap with WeakHashMap, added classloader leak detection using -verbose:class. Result: Metaspace stabilized at 80MB, no further OOMs.

Trade-off: WeakHashMap entries can be GC'd at any time, which means cached Class lookups may return null. Added a fallback path that reloads the class if the WeakHashMap entry was collected. Performance impact: ~0.1ms per cache miss, acceptable for a plugin system.

Key Takeaway

Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging. Heap dump for heap space, classloader stats for metaspace, NMT for direct memory, GC logs for overhead, thread dump for stack overflow.

Which OOM Type Are You Dealing With

Heap Dump Analysis: Finding the Leak

A heap dump is a snapshot of every object in the JVM heap at a point in time. It is the single most important diagnostic artifact for heap OOM. Without it, you are guessing. With it, you can identify the exact object, its reference chain to GC root, and its retained size.

The key concept is the dominator tree. In a heap dump, object A dominates object B if every path from GC roots to B goes through A. The dominator tree shows which objects retain the most memory. The top entries in the dominator tree are your leak suspects.

Eclipse MAT (Memory Analyzer Tool) is the standard tool for heap dump analysis. The three reports that matter most: Leak Suspects Report (automated analysis), Dominator Tree (manual exploration), and Histogram (object count by type).

The Leak Suspects Report is the starting point. It identifies objects with unusually high retained size and shows the reference chain from GC root. If the report identifies a single suspect consuming 60%+ of heap, you have found the leak.

But the automated report does not always find the leak. Some leaks are distributed — no single object dominates, but thousands of small objects accumulate. In this case, use the Histogram to find object types with unexpectedly high counts. Compare with a second heap dump taken 1 hour later. The type with the fastest-growing count is the leak source.

Production insight: always take at least two heap dumps, 30-60 minutes apart. A single dump shows the current state. Two dumps show the trend. The trend is what reveals leaks.

Heap dump caveat: taking a heap dump pauses the JVM (full stop-the-world) for the duration of the dump. For a 4GB heap, this can be 10-30 seconds. For a 32GB heap, it can be several minutes. Never take a heap dump on a production system during peak traffic without understanding the pause impact. Use jmap -dump:live,format=b,file=heap.hprof <pid> to force a full GC first and capture only live objects, reducing dump size.

Alternative for large heaps: use JFR allocation profiling (-XX:StartFlightRecording=settings=profile) to capture allocation patterns without a full heap dump. JFR adds less than 2% overhead and can run continuously in production. It does not show object graphs, but it shows which code is allocating the most memory.

Performance trade-off: heap dump pause time is proportional to live object count, not heap size. A 16GB heap with 2GB live objects dumps faster than an 8GB heap with 6GB live objects. Use -XX:+HeapDumpOnOutOfMemoryError (auto-dump on OOM) and -XX:HeapDumpPath=/var/log/jvm/ to ensure dumps are captured even during unattended failures.

heap_dump_analyzer.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

package io.thecodeforge.diagnostics;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryUsage;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

/**
 * Proactive heap monitor that captures dumps when memory
 * growth rate indicates a leak — before OOM occurs.
 */
public class ProactiveHeapMonitor {

    private final ScheduledExecutorService scheduler;
    private final List<Snapshot> history;
    private final long maxHeapMB;
    private final double growthRateThresholdMBPerHour;
    private final String dumpDirectory;

    public ProactiveHeapMonitor(
            long maxHeapMB,
            double growthRateThresholdMBPerHour,
            String dumpDirectory
    ) {
        this.maxHeapMB = maxHeapMB;
        this.growthRateThresholdMBPerHour = growthRateThresholdMBPerHour;
        this.dumpDirectory = dumpDirectory;
        this.history = new ArrayList<>();
        this.scheduler = Executors.newSingleThreadScheduledExecutor(
            r -> {
                Thread t = new Thread(r, "heap-monitor");
                t.setDaemon(true);
                return t;
            }
        );
    }

    public void start(long intervalSeconds) {
        scheduler.scheduleAtFixedRate(
            this::checkMemory,
            intervalSeconds,
            intervalSeconds,
            TimeUnit.SECONDS
        );
    }

    private void checkMemory() {
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();
        MemoryUsage heapUsage = memBean.getHeapMemoryUsage();
        long usedMB = heapUsage.getUsed() / (1024 * 1024);
        Instant now = Instant.now();

        history.add(new Snapshot(now, usedMB));

        // Keep only last 24 hours of snapshots
        Instant cutoff = now.minusSeconds(86400);
        history.removeIf(s -> s.timestamp.isBefore(cutoff));

        // Check absolute threshold
        double usagePercent = (double) usedMB / maxHeapMB;
        if (usagePercent > 0.85) {
            logWarning("Heap usage at " + (int)(usagePercent * 100)
                + "% (" + usedMB + "MB / " + maxHeapMB + "MB)");
            if (usagePercent > 0.90) {
                captureHeapDump("high-usage-" + now.getEpochSecond());
            }
        }

        // Check growth rate (leak detection)
        if (history.size() >= 2) {
            Snapshot oldest = history.get(0);
            Snapshot newest = history.get(history.size() - 1);
            double hoursElapsed = (newest.timestamp.toEpochMilli()
                - oldest.timestamp.toEpochMilli()) / 3_600_000.0;
            if (hoursElapsed > 0.5) {
                double growthRateMBPerHour = (newest.usedMB - oldest.usedMB)
                    / hoursElapsed;
                if (growthRateMBPerHour > growthRateThresholdMBPerHour) {
                    logWarning("Heap growth rate: " + growthRateMBPerHour
                        + " MB/hour — possible leak");
                    captureHeapDump("leak-suspect-" + now.getEpochSecond());
                }
            }
        }
    }

    private void captureHeapDump(String label) {
        String filename = dumpDirectory + "/heap-" + label + ".hprof";
        try {
            String pid = ManagementFactory.getRuntimeMXBean().getName()
                .split("@")[0];
            ProcessBuilder pb = new ProcessBuilder(
                "jmap", "-dump:live,format=b,file=" + filename, pid
            );
            pb.redirectErrorStream(true);
            Process p = pb.start();
            int exitCode = p.waitFor();
            if (exitCode == 0) {
                logWarning("Heap dump captured: " + filename);
            } else {
                logWarning("Heap dump failed with exit code: " + exitCode);
            }
        } catch (Exception e) {
            logWarning("Heap dump failed: " + e.getMessage());
        }
    }

    private void logWarning(String message) {
        System.err.println("[HeapMonitor] " + Instant.now() + " " + message);
    }

    private static class Snapshot {
        final Instant timestamp;
        final long usedMB;
        Snapshot(Instant timestamp, long usedMB) {
            this.timestamp = timestamp;
            this.usedMB = usedMB;
        }
    }
}

Two Dumps Beat One — Trend Analysis Reveals Leaks

Single dump: shows what is in the heap now. Useful for finding large objects. Cannot distinguish leak from legitimate usage.
Two dumps: shows what is growing. The object type with the fastest-growing count is the leak source.
Dominator tree: shows which objects retain the most memory. Top entries are leak suspects.
Leak Suspects Report: automated MAT analysis. Good starting point. Fails on distributed leaks (many small objects).
Histogram comparison: export histograms from both dumps, diff them. The type with the largest count increase is the leak.

Production Insight

A recommendation engine service used 12GB of its 16GB heap. The team took a single heap dump and found no single object dominating memory — the largest retained object was 200MB. They concluded the heap was simply too small and requested 32GB from infrastructure.

A senior engineer took two dumps 45 minutes apart and compared histograms. The count of io.thecodeforge.model.CachedRecommendation objects grew from 8.2 million to 8.7 million in 45 minutes — 666,000 new objects/hour, each ~1.2KB. The leak was distributed across millions of small objects, invisible in a single dump's dominator tree.

Cause: recommendation cache had no eviction policy. Each unique user+product combination created a CachedRecommendation that was never removed. Effect: 666K new objects/hour, ~800MB/hour growth. Impact: OOM every 20 hours. Action: added Caffeine cache with expireAfterWrite(1, TimeUnit.HOURS) and maximumSize(5_000_000). Result: steady-state heap dropped to 4GB, no OOM.

Key insight: single dump analysis missed this leak entirely because no single object dominated. Two-dump histogram comparison revealed it in minutes.

Key Takeaway

Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type. The dominator tree finds large objects; histogram comparison finds distributed leaks.

Heap Dump Analysis Strategy

GC Tuning: Collector Selection and Parameter Optimization

GC tuning is about trade-offs: throughput vs latency, pause time vs frequency, memory efficiency vs allocation speed. There is no universal best setting — the right configuration depends on your workload profile.

The four production GC collectors:

G1GC (default since JDK 9): balanced throughput and latency. Good default for most services. Tuning targets: -XX:MaxGCPauseMillis (default 200ms), -XX:G1HeapRegionSize, -XX:InitiatingHeapOccupancyPercent.

ZGC (JDK 15+): sub-millisecond pause times regardless of heap size. Best for latency-sensitive services (trading, real-time). Trade-off: slightly lower throughput, higher CPU usage for concurrent GC threads.

Shenandoah (JDK 12+): similar to ZGC — low pause times, concurrent compaction. Trade-off: same as ZGC. Choose based on JDK vendor support.

Parallel GC: highest throughput, longest pauses. Best for batch processing where latency does not matter. Not recommended for interactive services.

The most common GC tuning mistake: switching collectors without understanding the workload. A team switched from G1GC to ZGC because they read it was 'faster.' Their service was a batch ETL pipeline that did not care about pause times. ZGC's extra CPU overhead reduced throughput by 8% for zero benefit.

Rule of thumb: if your service is latency-sensitive (p99 < 100ms), use ZGC or Shenandoah. If throughput matters more than latency, use Parallel GC. For everything else, G1GC is the right default.

Humongous allocations are a G1GC-specific problem. Objects larger than 50% of a G1 region (default region size is ~1-2MB depending on heap) are classified as humongous. They are allocated in contiguous regions and only reclaimed during full GC. If your service allocates many large byte arrays or StringBuilders, humongous allocations cause premature old gen promotion and full GC storms.

Fix: increase -XX:G1HeapRegionSize to reduce humongous threshold, or refactor code to avoid large contiguous allocations. Check GC logs for 'Humongous allocation' lines.

GC log analysis is essential. Enable GC logging with -Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M (JDK 11+). Key metrics to monitor: GC pause duration (max, p99, p95), GC frequency (pauses per minute), allocation rate (MB/sec), promotion rate (young gen to old gen MB/sec), and old gen usage after GC.

Production insight: the most impactful GC parameter is often not the collector itself, but the heap size relative to live data. If your live data set is 2GB and your heap is 8GB, GC has plenty of room to work. If your live data set is 6GB and your heap is 8GB, GC is constantly under pressure. Right-sizing the heap matters more than collector selection.

Edge case: containerized JVMs with cgroup memory limits. Prior to JDK 10, the JVM did not respect cgroup limits and would set heap based on host memory. JDK 10+ respects cgroup limits. Always verify with -XX:+PrintFlagsFinal | grep MaxHeapSize that the JVM sees the correct memory limit.

gc_analyzer.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

package io.thecodeforge.monitoring;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

/**
 * GC Log Analyzer — parses GC logs and extracts key metrics
 * for production tuning decisions.
 */
public class GcLogAnalyzer {

    // JDK 11+ unified GC log format
    private static final Pattern GC_PAUSE_PATTERN = Pattern.compile(
        "\\[(?<timestamp>[\\d-T:.]+)\\]\\[(?<uptime>[\\d.]+)s\\]\\[(?<level>\\w+)\\]"
        + ".*GC\\((?<gcId>\\d+)\\) Pause (?<type>Young|Full|Mixed)"
        + ".*?(?<durationMs>[\\d.]+)ms"
    );

    private static final Pattern HEAP_PATTERN = Pattern.compile(
        "(?<used>\\d+)K->(?<after>\\d+)K\\((?<total>\\d+)K\\)"
    );

    public static class GcMetrics {
        public int totalGcPauses;
        public int youngGcCount;
        public int fullGcCount;
        public int mixedGcCount;
        public double maxPauseMs;
        public double p99PauseMs;
        public double p95PauseMs;
        public double avgPauseMs;
        public double totalPauseMs;
        public double gcTimePercent;
        public long maxHeapUsedKB;
        public long minHeapAfterGcKB;
        public List<Double> pauseTimes = new ArrayList<>();
    }

    public static GcMetrics analyze(String gcLogFile) throws IOException {
        GcMetrics metrics = new GcMetrics();

        try (BufferedReader reader = new BufferedReader(
                new FileReader(gcLogFile))) {
            String line;
            while ((line = reader.readLine()) != null) {
                Matcher pauseMatcher = GC_PAUSE_PATTERN.matcher(line);
                if (pauseMatcher.find()) {
                    double duration = Double.parseDouble(
                        pauseMatcher.group("durationMs"));
                    String type = pauseMatcher.group("type");

                    metrics.totalGcPauses++;
                    metrics.pauseTimes.add(duration);
                    metrics.totalPauseMs += duration;

                    switch (type) {
                        case "Young": metrics.youngGcCount++; break;
                        case "Full":  metrics.fullGcCount++; break;
                        case "Mixed": metrics.mixedGcCount++; break;
                    }

                    if (duration > metrics.maxPauseMs) {
                        metrics.maxPauseMs = duration;
                    }
                }

                Matcher heapMatcher = HEAP_PATTERN.matcher(line);
                if (heapMatcher.find()) {
                    long used = Long.parseLong(heapMatcher.group("used"));
                    long after = Long.parseLong(heapMatcher.group("after"));
                    if (used > metrics.maxHeapUsedKB) {
                        metrics.maxHeapUsedKB = used;
                    }
                    if (metrics.minHeapAfterGcKB == 0
                            || after < metrics.minHeapAfterGcKB) {
                        metrics.minHeapAfterGcKB = after;
                    }
                }
            }
        }

        // Calculate percentiles
        if (!metrics.pauseTimes.isEmpty()) {
            metrics.pauseTimes.sort(Double::compareTo);
            int size = metrics.pauseTimes.size();
            metrics.avgPauseMs = metrics.totalPauseMs / size;
            metrics.p95PauseMs = metrics.pauseTimes.get((int)(size * 0.95));
            metrics.p99PauseMs = metrics.pauseTimes.get((int)(size * 0.99));
        }

        return metrics;
    }

    public static String generateReport(GcMetrics m) {
        StringBuilder sb = new StringBuilder();
        sb.append("=== GC Analysis Report ===\n");
        sb.append("Total GC pauses: ").append(m.totalGcPauses).append("\n");
        sb.append("Young GC: ").append(m.youngGcCount).append("\n");
        sb.append("Full GC: ").append(m.fullGcCount).append("\n");
        sb.append("Mixed GC: ").append(m.mixedGcCount).append("\n");
        sb.append("Max pause: ").append(m.maxPauseMs).append(" ms\n");
        sb.append("P99 pause: ").append(m.p99PauseMs).append(" ms\n");
        sb.append("P95 pause: ").append(m.p95PauseMs).append(" ms\n");
        sb.append("Avg pause: ").append(String.format("%.2f", m.avgPauseMs)).append(" ms\n");
        sb.append("Max heap used: ").append(m.maxHeapUsedKB / 1024).append(" MB\n");
        sb.append("Min heap after GC: ").append(m.minHeapAfterGcKB / 1024).append(" MB\n");

        // Warnings
        if (m.fullGcCount > 0) {
            sb.append("WARNING: Full GC detected — investigate old gen pressure\n");
        }
        if (m.p99PauseMs > 200) {
            sb.append("WARNING: P99 pause > 200ms — consider ZGC or Shenandoah\n");
        }
        if (m.minHeapAfterGcKB > 0) {
            long liveDataMB = m.minHeapAfterGcKB / 1024;
            sb.append("INFO: Live data set ~").append(liveDataMB).append(" MB\n");
            sb.append("INFO: Recommended heap (2x live data): ")
                .append(liveDataMB * 2).append(" MB\n");
        }

        return sb.toString();
    }
}

The GC Trade-off Triangle

Throughput (Parallel GC): minimize time spent in GC relative to application work. Best for batch processing. Long pauses are acceptable.
Latency (ZGC/Shenandoah): minimize individual GC pause times. Best for real-time services. Higher CPU overhead is acceptable.
Memory efficiency (G1GC): balance between throughput and latency with moderate memory overhead. Best default for most services.
Humongous allocations: objects >50% of G1 region size cause full GC. Increase region size or refactor large allocations.
Container awareness: JDK 10+ respects cgroup limits. Always verify with PrintFlagsFinal. Pre-JDK 10 ignores container memory limits.

Production Insight

A trading platform used G1GC with 32GB heap. During market open, GC pauses reached 400ms — causing order processing delays and regulatory violations. The team tuned G1GC parameters for 3 weeks, reducing pauses to 250ms. Still not good enough.

Switching to ZGC reduced pauses to 0.8ms consistently. The trade-off: ZGC used 15% more CPU for concurrent GC threads. The platform had spare CPU capacity, so this was acceptable.

Cause: G1GC stop-the-world pauses during concurrent marking. Effect: 400ms pauses during peak allocation rate. Impact: order processing delays, regulatory SLA violations. Action: switched to ZGC with -XX:+UseZGC -Xmx32g -XX:ConcGCThreads=4. Result: 0.8ms p99 pauses, 15% CPU increase, zero SLA violations.

Trade-off: if the platform had been CPU-bound, ZGC's overhead would have been unacceptable. The fix worked because CPU was the cheaper resource to trade for latency. Always profile CPU usage before switching collectors.

Key Takeaway

GC tuning is about trade-offs: throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC. Full GC is always a problem — find the root cause.

GC Collector Selection

IfService requires p99 latency < 50ms

→

UseUse ZGC (JDK 15+) or Shenandoah (JDK 12+). Sub-millisecond pauses. Accept higher CPU overhead.

IfService is a batch job or ETL pipeline

→

UseUse Parallel GC. Highest throughput. Long pauses are acceptable since there is no user waiting.

IfGeneral-purpose web service or API

→

UseUse G1GC (JDK 9+ default). Tune MaxGCPauseMillis to your SLA. Good balance of throughput and latency.

Memory Leak Patterns and Detection

Memory leaks in Java are objects that are no longer needed but remain referenced, preventing garbage collection. Unlike C/C++ leaks (freed memory), Java leaks are reachable objects that should be unreachable.

The five most common leak patterns in production:

Unbounded collections — Maps, Lists, or Sets that grow without limit. The #1 cause of heap OOM. Fix: use bounded caches (Caffeine, Guava) with TTL and maximumSize.

Listener/callback registration without deregistration — registering event listeners that hold references to the subscriber object. When the subscriber should be GC'd, the listener reference keeps it alive. Fix: always deregister in close()/destroy() methods.

ThreadLocal without cleanup — ThreadLocal values persist for the lifetime of the thread. In thread pools, threads live forever. ThreadLocal values accumulate indefinitely. Fix: call threadLocal.remove() in a finally block after use.

ClassLoader leaks — in hot-redeploy environments, old classloaders remain referenced by static fields or thread-locals. The classloader cannot be GC'd, and neither can all classes it loaded. Fix: avoid static references to classes from dynamic classloaders. Use WeakReference or ServiceLoader patterns.

String.intern() abuse — String.intern() stores strings in the string pool (native memory pre-JDK 7, heap post-JDK 7). Interning user-generated strings creates an unbounded pool. Fix: never intern user input. Use a bounded cache with eviction instead.

Detection strategy: the sawtooth test. Monitor heap usage over time. A healthy JVM shows a sawtooth pattern — heap rises during allocation, drops after GC, returns to the same baseline. A leak shows the same sawtooth, but the baseline after GC increases over time. The post-GC baseline is the key metric.

Production tool: Java Flight Recorder (JFR) with allocation profiling. JFR records every significant allocation with the call stack. Enable with -XX:StartFlightRecording=settings=profile,duration=60s,filename=alloc.jfr. Analyze with JDK Mission Control (JMC) — the 'Allocation by Thread' and 'Allocation by Class' views show where memory is being allocated.

Edge case: soft reference accumulation. The JVM collects SoftReferences only when heap pressure is high. If your cache uses SoftReferences, it will consume all available heap before releasing entries. This is by design, but it makes heap appear full even when it is not leaking. Switch to WeakReference or use a proper cache library with size-based eviction.

Performance consideration: leak detection tools (JFR, MAT) add overhead. JFR adds <2% CPU overhead and can run continuously. MAT analysis requires a heap dump, which pauses the JVM. Use JFR for continuous monitoring and MAT for post-mortem analysis.

leak_detector.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

package io.thecodeforge.diagnostics;

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.GarbageCollectorMXBean;
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

/**
 * Memory Leak Detector — monitors old gen growth rate
 * to detect leaks before OOM occurs.
 *
 * Core insight: a leak shows as increasing old gen usage
 * after each full GC. The post-GC baseline is the key metric.
 */
public class MemoryLeakDetector {

    private final ScheduledExecutorService scheduler;
    private final List<OldGenSnapshot> snapshots;
    private final double alertThresholdMBPerHour;
    private final LeakAlertHandler alertHandler;

    public interface LeakAlertHandler {
        void onLeakDetected(double growthRateMBPerHour,
                           long currentOldGenMB,
                           String recommendation);
    }

    public MemoryLeakDetector(
            double alertThresholdMBPerHour,
            LeakAlertHandler alertHandler
    ) {
        this.alertThresholdMBPerHour = alertThresholdMBPerHour;
        this.alertHandler = alertHandler;
        this.snapshots = new ArrayList<>();
        this.scheduler = Executors.newSingleThreadScheduledExecutor(
            r -> {
                Thread t = new Thread(r, "leak-detector");
                t.setDaemon(true);
                return t;
            }
        );
    }

    public void start(long intervalSeconds) {
        scheduler.scheduleAtFixedRate(
            this::sampleOldGen,
            intervalSeconds,
            intervalSeconds,
            TimeUnit.SECONDS
        );
    }

    private void sampleOldGen() {
        long oldGenUsedMB = getOldGenUsedMB();
        Instant now = Instant.now();

        snapshots.add(new OldGenSnapshot(now, oldGenUsedMB));

        // Keep only last 6 hours
        Instant cutoff = now.minusSeconds(21600);
        snapshots.removeIf(s -> s.timestamp.isBefore(cutoff));

        // Need at least 30 minutes of data
        if (snapshots.size() < 6) return;

        // Calculate growth rate
        OldGenSnapshot oldest = snapshots.get(0);
        OldGenSnapshot newest = snapshots.get(snapshots.size() - 1);
        double hoursElapsed = (newest.timestamp.toEpochMilli()
            - oldest.timestamp.toEpochMilli()) / 3_600_000.0;

        if (hoursElapsed < 0.5) return;

        double growthRateMBPerHour = (newest.usedMB - oldest.usedMB)
            / hoursElapsed;

        if (growthRateMBPerHour > alertThresholdMBPerHour) {
            String recommendation = buildRecommendation(
                growthRateMBPerHour, newest.usedMB);
            alertHandler.onLeakDetected(
                growthRateMBPerHour, newest.usedMB, recommendation);
        }
    }

    private long getOldGenUsedMB() {
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            String name = pool.getName();
            if (name.contains("Old") || name.contains("Tenured")) {
                return pool.getUsage().getUsed() / (1024 * 1024);
            }
        }
        // Fallback: use heap usage
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();
        return memBean.getHeapMemoryUsage().getUsed() / (1024 * 1024);
    }

    private String buildRecommendation(
            double growthRateMBPerHour, long currentOldGenMB
    ) {
        StringBuilder sb = new StringBuilder();
        sb.append("Memory leak detected. ");
        sb.append("Growth rate: ").append(String.format("%.1f", growthRateMBPerHour));
        sb.append(" MB/hour. ");
        sb.append("Current old gen: ").append(currentOldGenMB).append(" MB. ");
        sb.append("Actions: ");
        sb.append("1) Capture heap dump (jmap -dump:live,format=b,file=heap.hprof). ");
        sb.append("2) Analyze with MAT — check dominator tree and histogram. ");
        sb.append("3) Compare with previous histogram to find growing object types.");
        return sb.toString();
    }

    private static class OldGenSnapshot {
        final Instant timestamp;
        final long usedMB;
        OldGenSnapshot(Instant timestamp, long usedMB) {
            this.timestamp = timestamp;
            this.usedMB = usedMB;
        }
    }
}

The Sawtooth Test — Is It a Leak or Just Load?

Healthy pattern: heap rises to 4GB, GC brings it back to 1.5GB. Next cycle: rises to 4GB, back to 1.5GB. Baseline is stable.
Leak pattern: heap rises to 4GB, GC brings it to 1.5GB. Next cycle: rises to 4GB, back to 1.8GB. Next: back to 2.1GB. Baseline is rising.
Key metric: old gen usage after full GC. Monitor this, not peak heap usage.
Detection: take snapshots every 30 seconds. Calculate growth rate of post-GC baseline. Alert if >5% per hour.
False positive: legitimate cache growth (new data being cached) looks like a leak. Distinguish by checking if the growth stabilizes.

Production Insight

A session management service showed stable memory usage for 6 months. After a feature release, the team noticed heap usage after GC growing at 100MB/hour. They suspected a leak but could not find it in the heap dump — no single object dominated.

The leak was a ThreadLocal in a request filter that stored user context. The filter was called on every request, and the ThreadLocal was set but never removed. In a thread pool, threads live forever, so ThreadLocal values accumulated indefinitely. Each user context was ~2KB. At 50,000 unique users per hour, that was 100MB/hour.

Cause: ThreadLocal.set() without ThreadLocal.remove() in a request filter. Effect: each thread accumulated user contexts for every user it served. Impact: 100MB/hour growth, OOM every 10 hours. Action: added threadLocal.remove() in a finally block after request processing. Result: memory growth dropped to zero.

Why the heap dump did not help: ThreadLocal values are stored in the Thread object's threadLocals map, not in a global collection. The dominator tree showed many Thread objects, each holding a small map. Without knowing to look at ThreadLocal, the dump appeared healthy.

Key Takeaway

Monitor post-GC old gen baseline, not peak heap usage. A rising baseline confirms a leak. Take two dumps 30-60 minutes apart and compare histograms. ThreadLocal and unbounded caches are the most common production leak sources.

Memory Leak Detection Strategy

Production JVM Configuration: Flags That Matter

JVM configuration is where most memory incidents are prevented — or caused. The wrong flags make debugging impossible. The right flags make it trivial.

Non-negotiable production flags:

-XX:+HeapDumpOnOutOfMemoryError — captures a heap dump when OOM occurs. Without this, you have no diagnostic data after the crash. Set -XX:HeapDumpPath to a persistent directory (not /tmp in containers — /tmp is often tmpfs and too small).

-Xlog:gc*:file=gc.log:time,uptime,level,tags:filecount=5,filesize=100M — enables GC logging with rotation. Essential for diagnosing GC issues. JDK 11+ syntax. For JDK 8, use -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:gc.log.

-XX:+ExitOnOutOfMemoryError — kills the JVM immediately on OOM instead of leaving it in an undefined state. In containerized environments, this ensures the container restarts via the orchestrator. Without this, the JVM may continue running in a degraded state, accepting requests it cannot process.

-XX:MaxRAMPercentage=70.0 — sets max heap as a percentage of container memory. Alternative to -Xmx for containerized deployments. Automatically adjusts when container limits change. Use 70-75% to leave room for off-heap.

Container memory calculation: Container memory = heap (Xmx) + metaspace + thread stacks (Xss × thread count) + direct memory (MaxDirectMemorySize) + native memory (JNI) + OS overhead.

Rule of thumb: set container memory limit to 1.3-1.5x your -Xmx value. For a 4GB heap, set container limit to 5.2-6GB. This covers metaspace (~100-200MB), thread stacks (200 threads × 1MB = 200MB), direct memory (~256MB), and OS overhead (~500MB).

Thread stack sizing: -Xss sets stack size per thread. Default is 512KB-1MB depending on OS. For services with many threads, this matters. 500 threads × 1MB = 500MB of stack memory. If your call depth is shallow, reduce to -Xss256k. If you have deep recursion, increase to -Xss2m.

Metaspace sizing: -XX:MaxMetaspaceSize limits metaspace growth. Without this limit, metaspace can consume all available native memory. Set it to a reasonable value (256MB-512MB for most services). If you hit the limit, it indicates a classloader leak, not insufficient space.

JFR continuous recording: -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h — enables continuous JFR recording with rolling buffer. When an incident occurs, dump the recording with jcmd <pid> JFR.dump. This gives you allocation, GC, and lock profiling data without restarting the service.

Edge case: -XX:+UseCompressedOops is enabled by default for heaps <32GB. It compresses object pointers from 8 bytes to 4 bytes, saving ~20% heap. Above 32GB, compressed oops are disabled and each object pointer costs 8 bytes. This means a 34GB heap may perform worse than a 31GB heap due to pointer size increase. Either stay under 32GB or go significantly above (40GB+).

production_jvm_flags.shBASH

#!/bin/bash
# Production JVM flags for containerized Java services
# Tested on JDK 17 with G1GC and ZGC configurations

# ============================================================
# BASELINE CONFIGURATION (G1GC — suitable for most services)
# ============================================================

JVM_BASE_FLAGS="
  # Memory
  -XX:MaxRAMPercentage=70.0
  -XX:InitialRAMPercentage=50.0
  -XX:MaxMetaspaceSize=256m
  -Xss512k

  # GC — G1GC
  -XX:+UseG1GC
  -XX:MaxGCPauseMillis=200
  -XX:G1HeapRegionSize=4m
  -XX:InitiatingHeapOccupancyPercent=45

  # Diagnostics (non-negotiable)
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/jvm/heapdump.hprof
  -XX:+ExitOnOutOfMemoryError
  -XX:+CrashOnOutOfMemoryError

  # GC Logging (JDK 11+)
  -Xlog:gc*:file=/var/log/jvm/gc.log:time,uptime,level,tags:filecount=5,filesize=100m

  # JFR continuous recording
  -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h,name=continuous,filename=/var/log/jvm/recording.jfr

  # Compressed oops (auto-enabled <32GB heap)
  -XX:+UseCompressedOops
  -XX:+UseCompressedClassPointers
"

# ============================================================
# LOW-LATENCY CONFIGURATION (ZGC — for p99 < 10ms services)
# ============================================================

JVM_ZGC_FLAGS="
  # Memory
  -XX:MaxRAMPercentage=70.0
  -XX:MaxMetaspaceSize=256m
  -Xss512k

  # GC — ZGC
  -XX:+UseZGC
  -XX:+ZGenerational          # JDK 21+ generational ZGC
  -XX:ConcGCThreads=4
  -XX:ParallelGCThreads=8

  # Diagnostics
  -XX:+HeapDumpOnOutOfMemoryError
  -XX:HeapDumpPath=/var/log/jvm/heapdump.hprof
  -XX:+ExitOnOutOfMemoryError
  -Xlog:gc*:file=/var/log/jvm/gc.log:time,uptime,level,tags:filecount=5,filesize=100m
  -XX:StartFlightRecording=settings=default,maxsize=100M,maxage=1h,name=continuous
"

# ============================================================
# CONTAINER MEMORY CALCULATION
# ============================================================
#
# For a 4GB heap (-XX:MaxRAMPercentage=70.0 on a 5.7GB container):
#
#   Heap:           4000 MB  (70% of 5700MB)
#   Metaspace:       256 MB  (MaxMetaspaceSize)
#   Thread stacks:   200 MB  (400 threads × 512KB)
#   Direct memory:   256 MB  (default = Xmx)
#   GC overhead:     200 MB  (G1GC bookkeeping)
#   Native/JNI:      300 MB  (JNI libraries, socket buffers)
#   OS overhead:     500 MB  (page cache, file descriptors)
#   ----------------------------------------
#   Total:          5712 MB  (container limit: 5.7GB)
#
# Formula: Container = Xmx × 1.43 (round up to nearest 256MB)
# ============================================================

echo "JVM flags configured for production deployment"
echo "Container memory recommendation: Xmx × 1.43"

Container Memory Budget — Every Byte Counts

Heap (Xmx): 70% of container memory. This is your working memory for objects.
Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. Reduce with -Xss256k if call depth is shallow.
Metaspace: 100-256MB for most services. Set MaxMetaspaceSize to prevent runaway growth.
Direct memory: default equals Xmx. Set MaxDirectMemorySize explicitly if using NIO/Netty.
OS overhead: 300-500MB for page cache, file descriptors, socket buffers. Never allocate 100% of container memory to JVM.

Production Insight

A Kubernetes deployment set container memory limit to 4GB and -Xmx to 4GB. The service ran fine during normal traffic. During a traffic spike, the container was OOM-killed (exit code 137) every 2-3 hours. No JVM OOM error was logged — the OS killed the process before the JVM could detect the issue.

The team added NativeMemoryTracking and discovered the JVM was using 4.8GB total: 4GB heap + 300MB metaspace + 200MB thread stacks + 300MB direct memory. The container limit was 4GB, so the OS killed the process when total usage exceeded the limit.

Cause: -Xmx set equal to container memory limit with no room for off-heap. Effect: container OOM killer terminated the process. Impact: 3-5 restarts per day during peak traffic. Action: increased container limit to 6GB (4GB × 1.5), kept -Xmx at 4GB. Result: zero OOM kills.

Lesson: container memory limit must be 1.3-1.5x the heap size. The extra 30-50% covers off-heap usage that the JVM does not track against -Xmx.

Key Takeaway

Set container memory to 1.43x your heap size. Always enable heap dump on OOM, GC logging, and JFR. These three flags turn production memory incidents from guesswork into diagnosis. Without them, you are flying blind.

JVM Flag Configuration Decisions

Off-Heap Memory: Direct Buffers, Native Memory, and Thread Stacks

Most JVM memory guides focus exclusively on heap. In production, off-heap memory causes at least 30% of OOM incidents. The container OOM killer does not care whether the memory is heap or off-heap — it kills when total usage exceeds the limit.

Direct ByteBuffer — allocated via ByteBuffer.allocateDirect(). Lives outside the heap in native memory. Used by NIO channels, Netty, gRPC, and file I/O. The JVM tracks direct buffer usage against -XX:MaxDirectMemorySize (default = -Xmx). If direct buffer allocation exceeds this limit, you get OOM: Direct buffer memory.

The insidious part: direct buffers are freed by a ReferenceQueue-based cleaner, not immediately when the buffer is GC'd. If the application allocates direct buffers faster than the GC and cleaner can reclaim them, you get OOM even though the buffers are technically unreachable. This is a rate problem, not a leak problem.

Thread stacks — each thread has a stack of size -Xss. Default is 512KB-1MB. 500 threads × 1MB = 500MB. This memory is allocated at thread creation and never shrinks. In services with dynamic thread pools, thread count can grow under load, consuming more stack memory.

Metaspace — class metadata storage. Replaces PermGen (JDK 7). Grows as classes are loaded. Bounded by -XX:MaxMetaspaceSize. Unbounded by default — can consume all native memory if not limited.

JNI native memory — memory allocated by native libraries via JNI. The JVM does not track this. Common sources: database drivers (OCI, native JDBC), compression libraries (zlib, snappy), and cryptographic providers. Use NativeMemoryTracking to estimate.

MappedByteBuffer — file-backed memory mapping via FileChannel.map(). Maps file contents directly into process address space. Not counted against heap or MaxDirectMemorySize. Large memory-mapped files can trigger container OOM.

Diagnosis tool: NativeMemoryTracking (NMT). Enable with -XX:NativeMemoryTracking=detail. Query with jcmd <pid> VM.native_memory summary. NMT shows memory breakdown by category: Java Heap, Class (metaspace), Thread, Code, GC, Internal, Symbol, Malloc, and Mapped.

Performance caveat: NMT adds 5-10% overhead in detail mode. Use -XX:NativeMemoryTracking=summary for production (1-2% overhead). Switch to detail mode only during active debugging.

Edge case: Netty's PooledByteBufAllocator recycles direct buffers to avoid allocation overhead. If the pool grows under load, it retains memory even after the buffers are released. Monitor Netty's pool metrics (PooledByteBufAllocator.metric()) to detect pool bloat.

off_heap_monitor.javaJAVA

100

101

102

103

104

package io.thecodeforge.monitoring;

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryUsage;
import java.lang.management.ThreadMXBean;
import java.nio.ByteBuffer;
import java.util.HashMap;
import java.util.Map;

/**
 * Off-Heap Memory Monitor — tracks memory usage outside
 * the JVM heap that contributes to container OOM kills.
 */
public class OffHeapMonitor {

    public static class OffHeapReport {
        public long metaspaceUsedMB;
        public long metaspaceMaxMB;
        public long threadStackMB;
        public int threadCount;
        public long directMemoryMaxMB;
        public long compressedClassSpaceMB;
        public long codeCacheMB;
        public Map<String, String> recommendations = new HashMap<>();

        public long totalOffHeapMB() {
            return metaspaceUsedMB + threadStackMB
                + compressedClassSpaceMB + codeCacheMB;
        }
    }

    public static OffHeapReport analyze() {
        OffHeapReport report = new OffHeapReport();

        // Metaspace
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            String name = pool.getName();
            MemoryUsage usage = pool.getUsage();

            if (name.contains("Metaspace")) {
                report.metaspaceUsedMB = usage.getUsed() / (1024 * 1024);
                report.metaspaceMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : -1;
            } else if (name.contains("Compressed Class Space")) {
                report.compressedClassSpaceMB = usage.getUsed() / (1024 * 1024);
            } else if (name.contains("Code Cache")) {
                report.codeCacheMB = usage.getUsed() / (1024 * 1024);
            }
        }

        // Thread stacks
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        report.threadCount = threadBean.getThreadCount();
        // Estimate: each thread uses -Xss (default ~1MB)
        // More accurate: check -XX:ThreadStackSize via VM flags
        report.threadStackMB = report.threadCount; // Rough estimate: 1MB per thread

        // Direct memory limit
        try {
            long maxDirectMemory = sun.misc.VM.maxDirectMemory();
            report.directMemoryMaxMB = maxDirectMemory / (1024 * 1024);
        } catch (Exception e) {
            report.directMemoryMaxMB = -1;
        }

        // Recommendations
        if (report.metaspaceUsedMB > 200) {
            report.recommendations.put("metaspace",
                "Metaspace using " + report.metaspaceUsedMB
                + "MB — check for classloader leaks");
        }
        if (report.threadCount > 300) {
            report.recommendations.put("threads",
                report.threadCount + " threads active — "
                + report.threadStackMB + "MB in stacks. "
                + "Consider reducing thread pool size or -Xss.");
        }
        long totalOffHeap = report.totalOffHeapMB();
        if (totalOffHeap > 1024) {
            report.recommendations.put("total",
                "Total off-heap: " + totalOffHeap + "MB. "
                + "Ensure container memory limit accounts for this.");
        }

        return report;
    }

    /**
     * Monitor direct buffer allocation rate.
     * Call this periodically to detect direct memory pressure.
     */
    public static long getDirectMemoryUsedEstimate() {
        // NMT is more accurate, but this gives a quick estimate
        // by attempting a small allocation and checking if it succeeds
        try {
            ByteBuffer test = ByteBuffer.allocateDirect(1024);
            test = null;
            return -1; // Allocation succeeded — no pressure
        } catch (OutOfMemoryError e) {
            return 0; // Direct memory exhausted
        }
    }
}

The Hidden 30% — Off-Heap Memory Budget

Thread stacks: 512KB-1MB per thread. 400 threads = 200-400MB. This grows if thread pool scales up under load.
Metaspace: 100-256MB typical. Unbounded by default. Set MaxMetaspaceSize to prevent runaway growth.
Direct buffers: tracked by MaxDirectMemorySize. Default equals Xmx. Netty pools can retain memory even after release.
Native memory: JNI libraries, socket buffers, file descriptors. Not tracked by JVM. Use NMT for estimates.
MappedByteBuffer: file-backed mapping. Not counted against heap or direct memory. Large files can trigger container OOM.

Production Insight

A gRPC service using Netty experienced container OOM kills despite heap usage never exceeding 60%. The team was baffled — heap monitoring showed no pressure.

NativeMemoryTracking revealed the issue: Netty's PooledByteBufAllocator had grown to 1.8GB of direct buffers during a traffic spike. The pool retained these buffers even after the gRPC calls completed, waiting for reuse. The container had 4GB limit, 2.4GB heap, 1.8GB Netty pool, 400MB other off-heap = 4.6GB total. Container OOM killer struck.

Cause: Netty pooled allocator retained 1.8GB of direct buffers. Effect: total memory exceeded 4GB container limit. Impact: 4-6 container OOM kills per day during peak traffic. Action: set Netty's PooledByteBufAllocator maxOrder=8 (reduced pool size) and added -XX:MaxDirectMemorySize=512m. Result: direct buffer usage stabilized at 400MB, no OOM kills.

Lesson: Netty's buffer pool is off-heap and invisible to heap monitoring. Always monitor total JVM memory (heap + off-heap), not just heap.

Key Takeaway

Off-heap memory is invisible to heap monitoring but visible to the container OOM killer. Enable NativeMemoryTracking, monitor thread count, and set explicit limits for direct memory and metaspace. The container measures total memory, not just heap.

Off-Heap Memory Troubleshooting

Building a Production Memory Monitoring Stack

Memory incidents are preventable with the right monitoring. The goal is to detect problems hours before they cause OOM — not after.

Layer 1 — JVM metrics (Prometheus/JMX): Expose heap usage, GC pause times, GC count, thread count, and metaspace usage via JMX. Use Micrometer or JMX Exporter for Prometheus integration. Key alerts: - Old gen usage after GC > 70% for 10 minutes → warning - Old gen usage after GC > 85% for 5 minutes → critical - GC pause p99 > 500ms → warning - GC pause p99 > 2s → critical - Thread count > 80% of max pool size → warning - Full GC count > 0 in last hour → investigate

Layer 2 — Container metrics (cAdvisor/Kubernetes): Monitor container memory usage (not just JVM heap). Key alerts: - Container memory > 85% of limit → warning - Container memory > 95% of limit → critical (OOM imminent) - Container restart count > 0 in last hour → investigate

Layer 3 — Application-level metrics: Track object counts for known leak-prone structures: session cache size, connection pool size, thread-local count. These are domain-specific and catch leaks that JVM metrics miss.

Alerting philosophy: Alert on trends, not thresholds. A heap at 80% is fine if it returns to 40% after GC. A heap at 60% is a problem if it never drops below 55% after GC. The post-GC baseline trend is the most important metric.

Automated remediation: For containerized services, configure liveness probes that check heap usage. If heap exceeds 90%, the probe fails and Kubernetes restarts the pod. This is a safety net, not a fix — but it prevents the service from running in a degraded state while you investigate.

Retention and analysis: Keep GC logs and heap dumps for at least 7 days. Memory leaks can take days to manifest. If you only keep 24 hours of logs, you lose the trend data needed for diagnosis. Store dumps in object storage (S3, GCS) with lifecycle policies.

Production insight: the monitoring stack itself must not consume significant memory. A common mistake is running a heavy APM agent (100-200MB overhead) alongside the JVM. In a 2GB heap container, the agent consumes 5-10% of total memory. Use lightweight agents (JMX Exporter <20MB) or expose metrics via an HTTP endpoint without an agent.

📚 RELATED NEXT STEPS

→ JVM GC Tuning Guide: G1, ZGC, Shenandoah Explained with Real Trade-offs — Once the incident is stable, tune GC to prevent recurrence

→ Java Memory Leaks and Prevention — If heap grows slowly over hours rather than crashing fast

→ JVM Memory Model — Understand heap regions and metaspace before tuning flags

memory_metrics_exporter.javaJAVA

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

package io.thecodeforge.monitoring;

import com.sun.management.OperatingSystemMXBean;
import java.lang.management.*;
import java.util.HashMap;
import java.util.Map;

/**
 * Memory Metrics Exporter — exposes JVM memory metrics
 * for Prometheus/monitoring integration.
 *
 * Lightweight alternative to heavy APM agents.
 * Estimated overhead: <5MB heap, <0.1% CPU.
 */
public class MemoryMetricsExporter {

    public static class MemoryMetrics {
        // Heap
        public long heapUsedMB;
        public long heapMaxMB;
        public long heapCommittedMB;
        public double heapUsagePercent;

        // Young gen
        public long youngGenUsedMB;
        public long youngGenMaxMB;

        // Old gen
        public long oldGenUsedMB;
        public long oldGenMaxMB;
        public double oldGenUsagePercent;

        // Off-heap
        public long metaspaceUsedMB;
        public long metaspaceMaxMB;
        public long threadCount;
        public long threadStackEstimateMB;
        public long directMemoryMaxMB;

        // GC
        public long youngGcCount;
        public long youngGcTimeMs;
        public long fullGcCount;
        public long fullGcTimeMs;
        public double gcTimePercent;

        // Container
        public long containerMemoryLimitMB;
        public long processPhysicalMemoryMB;
        public double containerUsagePercent;
    }

    public static MemoryMetrics collect() {
        MemoryMetrics m = new MemoryMetrics();
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();

        // Heap
        MemoryUsage heap = memBean.getHeapMemoryUsage();
        m.heapUsedMB = heap.getUsed() / (1024 * 1024);
        m.heapMaxMB = heap.getMax() / (1024 * 1024);
        m.heapCommittedMB = heap.getCommitted() / (1024 * 1024);
        m.heapUsagePercent = (double) heap.getUsed() / heap.getMax() * 100;

        // Memory pools
        for (MemoryPoolMXBean pool : ManagementFactory.getMemoryPoolMXBeans()) {
            String name = pool.getName();
            MemoryUsage usage = pool.getUsage();
            if (name.contains("Eden") || name.contains("Survivor")) {
                m.youngGenUsedMB += usage.getUsed() / (1024 * 1024);
                if (usage.getMax() > 0) {
                    m.youngGenMaxMB += usage.getMax() / (1024 * 1024);
                }
            } else if (name.contains("Old") || name.contains("Tenured")) {
                m.oldGenUsedMB = usage.getUsed() / (1024 * 1024);
                m.oldGenMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : 0;
                m.oldGenUsagePercent = m.oldGenMaxMB > 0
                    ? (double) m.oldGenUsedMB / m.oldGenMaxMB * 100 : 0;
            } else if (name.contains("Metaspace")) {
                m.metaspaceUsedMB = usage.getUsed() / (1024 * 1024);
                m.metaspaceMaxMB = usage.getMax() > 0
                    ? usage.getMax() / (1024 * 1024) : -1;
            }
        }

        // GC stats
        long uptimeMs = ManagementFactory.getRuntimeMXBean().getUptime();
        for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) {
            String name = gc.getName();
            if (name.contains("Young") || name.contains("Scavenge")
                    || name.contains("G1 Young")) {
                m.youngGcCount = gc.getCollectionCount();
                m.youngGcTimeMs = gc.getCollectionTime();
            } else if (name.contains("Old") || name.contains("MarkSweep")
                    || name.contains("G1 Old")) {
                m.fullGcCount = gc.getCollectionCount();
                m.fullGcTimeMs = gc.getCollectionTime();
            }
        }
        m.gcTimePercent = (double)(m.youngGcTimeMs + m.fullGcTimeMs)
            / uptimeMs * 100;

        // Threads
        ThreadMXBean threadBean = ManagementFactory.getThreadMXBean();
        m.threadCount = threadBean.getThreadCount();
        m.threadStackEstimateMB = m.threadCount; // ~1MB per thread estimate

        // Container / OS memory
        try {
            OperatingSystemMXBean osBean = (OperatingSystemMXBean)
                ManagementFactory.getOperatingSystemMXBean();
            long totalPhysical = osBean.getTotalPhysicalMemorySize();
            long freePhysical = osBean.getFreePhysicalMemorySize();
            m.processPhysicalMemoryMB = (totalPhysical - freePhysical)
                / (1024 * 1024);
            m.containerMemoryLimitMB = totalPhysical / (1024 * 1024);
            m.containerUsagePercent = (double) m.processPhysicalMemoryMB
                / m.containerMemoryLimitMB * 100;
        } catch (Exception e) {
            // Not available on all JVMs
        }

        return m;
    }

    public static String toPrometheusFormat(MemoryMetrics m) {
        StringBuilder sb = new StringBuilder();
        sb.append("# HELP jvm_memory_heap_used_bytes JVM heap used\n");
        sb.append("# TYPE jvm_memory_heap_used_bytes gauge\n");
        sb.append("jvm_memory_heap_used_bytes ")
            .append(m.heapUsedMB * 1024 * 1024).append("\n\n");

        sb.append("# HELP jvm_memory_old_gen_usage_percent Old gen usage\n");
        sb.append("# TYPE jvm_memory_old_gen_usage_percent gauge\n");
        sb.append("jvm_memory_old_gen_usage_percent ")
            .append(String.format("%.2f", m.oldGenUsagePercent)).append("\n\n");

        sb.append("# HELP jvm_gc_full_count Full GC count\n");
        sb.append("# TYPE jvm_gc_full_count counter\n");
        sb.append("jvm_gc_full_count ").append(m.fullGcCount).append("\n\n");

        sb.append("# HELP jvm_memory_container_usage_percent Container memory usage\n");
        sb.append("# TYPE jvm_memory_container_usage_percent gauge\n");
        sb.append("jvm_memory_container_usage_percent ")
            .append(String.format("%.2f", m.containerUsagePercent)).append("\n");

        return sb.toString();
    }
}

Three-Layer Memory Monitoring

Layer 1 (JVM): heap usage, GC pauses, GC count, metaspace, thread count. Catches heap leaks and GC problems.
Layer 2 (Container): total memory usage, restart count, OOM kill count. Catches off-heap issues that JVM metrics miss.
Layer 3 (Application): session cache size, connection pool size, custom object counts. Catches domain-specific leaks.
Alert on trends: post-GC old gen baseline rising = leak. Post-GC old gen stable = right-sizing issue.
Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses trend data.

Production Insight

A team had comprehensive JVM monitoring (heap, GC, threads) but no container-level monitoring. They experienced intermittent OOM kills that their JVM metrics did not predict. The issue was off-heap growth from a native compression library that consumed 800MB during peak traffic.

Adding container memory monitoring (cAdvisor + Prometheus) immediately revealed the pattern: container memory grew to 95% of limit while heap stayed at 60%. The team added a container memory alert at 85% and got 30 minutes of warning before each OOM kill.

Cause: native compression library allocated 800MB outside JVM heap. Effect: container OOM kills with no JVM-level warning. Impact: 2-3 unexpected restarts per week. Action: added container memory alerting, reduced compression buffer size, increased container limit. Result: zero OOM kills, 30+ minute early warning on memory pressure.

Monitoring overhead: the JMX Exporter added <5MB heap overhead and <0.1% CPU. The alternative (Datadog APM agent) would have added 150MB heap overhead — 7.5% of the 2GB container. Lightweight monitoring is essential in memory-constrained containers.

Key Takeaway

Three layers of monitoring: JVM (heap, GC, threads), container (total memory, OOM kills), and application (caches, pools). Alert on post-GC old gen trends, not absolute values. Keep diagnostic data for 7+ days — leaks take time to manifest.

Memory Monitoring Stack Decisions

Memory Structure: The Territory You're Actually Debugging

Before you throw flags at a production issue, understand the terrain. The JVM divides memory into runtime data areas. Some are shared across all threads, some are private per thread. Confuse them and you'll chase ghosts.

The heap is where objects live and die. It's the garbage collector's playground. Created when the JVM starts, resized by -Xms and -Xmx. One heap per JVM process. Period. Objects allocated with new go here. Their references live elsewhere.

Thread stacks are per-thread. Each stack frame holds local variables, partial results, and method invocations. Overflow a stack with deep recursion and you get StackOverflowError. No GC runs here — frames pop when methods return.

The Method Area stores class metadata, static variables, constant pools, and method bytecode. In modern HotSpot JVMs, this is Metaspace — native memory outside the heap. PermGen is dead. Update your mental model.

Native Method Stacks handle JNI calls. PC registers track current instruction pointers per thread. You rarely touch these directly, but when a native memory leak hits, you'll thank yourself for knowing they exist.

Know the map. It beats wandering blind.

MemoryRegionsCheck.javaJAVA

// io.thecodeforge — java tutorial
// Dump memory regions via JMX to see what's alive

import java.lang.management.*;

public class MemoryRegionsCheck {
    public static void main(String[] args) {
        MemoryMXBean memBean = ManagementFactory.getMemoryMXBean();
        System.out.println("=== Heap Memory ===");
        MemoryUsage heap = memBean.getHeapMemoryUsage();
        System.out.printf("Init: %d MB\n", heap.getInit() / 1048576);
        System.out.printf("Used: %d MB\n", heap.getUsed() / 1048576);
        System.out.printf("Max:  %d MB\n", heap.getMax() / 1048576);

        System.out.println("\n=== Non-Heap (Metaspace) ===");
        MemoryUsage nonHeap = memBean.getNonHeapMemoryUsage();
        System.out.printf("Init: %d MB\n", nonHeap.getInit() / 1048576);
        System.out.printf("Used: %d MB\n", nonHeap.getUsed() / 1048576);

        // Per-thread stack info not available via JMX — use jstack
        System.out.println("\nThread count: " + ManagementFactory.getThreadMXBean().getThreadCount());
    }
}

Output

=== Heap Memory ===

Init: 256 MB

Used: 184 MB

Max: 2048 MB

=== Non-Heap (Metaspace) ===

Init: 2 MB

Used: 67 MB

Thread count: 23

Production Trap:

Never assume Metaspace is unlimited. Classloader leaks can inflate Metaspace to gigabytes. Monitor with -XX:MaxMetaspaceSize. Default is infinite on 64-bit.

Key Takeaway

The JVM's memory regions are fixed battlefields. Know where you're fighting before you unsheathe a tool.

Heap Generations: Why Object Age Matters

The heap isn't one blob. It's generational. Young, Old, and sometimes a permanence graveyard. This design exists because most objects die young. 90% die within seconds. Generational GC exploits this.

Young Generation is the nursery. Newborn objects land in Eden. Minor GCs sweep here frequently. Survivor spaces (S0, S1) hold objects that survive one or two cycles. Each survivor age increments. After a threshold (default 15, tunable via -XX:MaxTenuringThreshold), they're promoted to Old.

Old Generation is the long-term care ward. Objects that survive multiple young GCs end up here. Major GCs are slower and more expensive. If Old Gen fills, you get a Full GC — stop-the-world pause that can kill latency SLAs.

Metaspace (replacing PermGen) stores class metadata. It grows as you load classes — dynamic proxies, lambdas, frameworks. Shrinkage? Only on Full GCs.

Tooling: jstat -gcutil <pid> 1000 shows Eden, Survivor, Old usage live. Watch Young GC frequency. If Old Gen grows monotonically, you have a leak or a sizing problem. Survivor space overflow? Objects promote too early — increase SurvivorRatio.

Generations aren't academic. They're where latency hides.

GenerationSizing.javaJAVA

// io.thecodeforge — java tutorial
// Simulate object promotion to understand generation behavior

import java.util.ArrayList;
import java.util.List;

public class GenerationSizing {
    public static void main(String[] args) throws InterruptedException {
        List<byte[]> holder = new ArrayList<>();
        System.out.println("Allocating to trigger generational GC...");
        
        for (int i = 0; i < 100; i++) {
            // 100 KB objects — will survive young GC if we hold reference
            byte[] chunk = new byte[100 * 1024];
            if (i % 10 == 0) {
                holder.add(chunk);  // Keep reference to simulate promotion
            }
            Thread.sleep(50);
        }
        System.out.println("Done. Check jstat -gcutil <pid> to see Old Gen growth.");
        System.out.println("Holding " + holder.size() + " objects.");
        Thread.sleep(30000);  // Keep alive for inspection
    }
}

Output

Allocating to trigger generational GC...

Done. Check jstat -gcutil <pid> to see Old Gen growth.

Holding 10 objects.

// Run alongside: jstat -gcutil <pid> 1000

// Sample output:

// S0 S1 E O M YGC YGCT FGC FGCT GCT

// 12.34 0.00 45.67 22.10 78.90 42 0.234 1 0.045 0.279

// Note O column grows each promotion

Senior Shortcut:

If Old Gen grows after every Full GC, you're leaking. If Old Gen stays flat but Full GCs are frequent, resize. Use -XX:NewRatio=2 to give Old Gen 2/3 of heap. Adapt to your allocation rate.

Key Takeaway

Generational GC is a bet on object mortality. If your objects live too long, you lose that bet. Watch promotion rates.

JConsole and VisualVM: Live Heap Reconnaissance

Before you dump the heap and wade through gigabytes of objects, JConsole and VisualVM give you a live feed of what the JVM is doing right now. Why this matters: memory leaks don't wait for a dump. They grow silently over minutes or hours. JConsole attaches to a running process and exposes real-time heap usage, thread activity, and GC behavior through JMX. VisualVM adds visual heap histograms, CPU profiling, and live object allocation traces. The practical workflow is direct: watch the heap graph trend upward while a full GC fails to bring it down — that's your leak signal. From there, you can take a targeted heap dump at the peak, not a blind one. Both tools also reveal thread stacks tied to memory growth, letting you spot forgotten caches, unbounded thread-local maps, or accumulating event queues. For production use, enable JMX with authentication and attach via a tunnel. Don't start a troubleshooting session without first looking at these live charts.

HeapMonitor.javaJAVA

// io.thecodeforge — java tutorial

import java.lang.management.ManagementFactory;
import java.lang.management.MemoryMXBean;
import java.lang.management.MemoryUsage;

public class HeapMonitor {
    public static void main(String[] args) throws InterruptedException {
        MemoryMXBean mxBean = ManagementFactory.getMemoryMXBean();
        while (true) {
            MemoryUsage heap = mxBean.getHeapMemoryUsage();
            long usedMB = heap.getUsed() / (1024 * 1024);
            long maxMB = heap.getMax() / (1024 * 1024);
            System.out.printf("Heap used: %d MB / %d MB%n", usedMB, maxMB);
            Thread.sleep(2000);
        }
    }
}

Output

Heap used: 128 MB / 2048 MB

Heap used: 156 MB / 2048 MB

Heap used: 201 MB / 2048 MB

Production Trap:

Enabling JMX without authentication on a public network is an open door. Always set com.sun.management.jmxremote.ssl and com.sun.management.jmxremote.authenticate to true, or use a local SSH tunnel.

Key Takeaway

Use JConsole or VisualVM to spot a memory leak in real time by watching the heap trend after a full GC — then take a targeted dump at the peak.

Unregister Listeners and Callbacks: The Hidden Leak Factory

The most insidious memory leaks in Java don't come from big collections — they come from a single listener you forgot to unregister. Why this matters: every registered callback holds a strong reference to the enclosing object, preventing GC even after the object is logically dead. In Swing, Android, or any event-driven framework, adding a listener to a long-lived component (like a Service or Application) chains that listener's entire object graph to the component's lifecycle. The result: your application slowly fills with zombie objects that cannot be reclaimed. The fix is direct and boring — always unregister in the reverse order of registration, ideally in a finally block or using try-with-resources on a listener holder. Tools like Eclipse Memory Analyzer (MAT) expose these patterns through the 'incoming references' view: trace from the component down to the leaked listener. Set up leak detection tests that assert listener count after a lifecycle event. Remember: if you add, you must remove.

SharedRegistry.javaJAVA

// io.thecodeforge — java tutorial

import java.util.*;
import java.util.concurrent.CopyOnWriteArrayList;

public class SharedRegistry {
    private final List<Runnable> listeners = new CopyOnWriteArrayList<>();

    public void register(Runnable listener) {
        listeners.add(Objects.requireNonNull(listener));
    }

    public void unregister(Runnable listener) {
        listeners.remove(listener);  // must be called when done
    }

    public static void main(String[] args) {
        SharedRegistry reg = new SharedRegistry();
        reg.register(() -> System.out.println("leaked"));
        // forgetting unregister() creates a hard reference leak
    }
}

Output

// no output — leak occurs silently

Production Trap:

Anonymous inner classes or lambdas used as listeners cannot be removed unless you hold a reference to them. Store the lambda in a field before calling register, then unregister that same reference.

Key Takeaway

Every registered listener must be unregistered when the owning object is discarded — otherwise the object and its entire graph leak permanently.

● Production incidentPOST-MORTEMseverity: high

The Slow Leak That Killed Black Friday: HashMap Growth Under Concurrent Load

Symptom

Checkout service OOM crashed at 8:47 PM on Black Friday. The heap dump showed 7.8GB of the 8GB heap consumed by a single ConcurrentHashMap inside io.thecodeforge.service.CheckoutSessionManager. The map had 14.2 million entries. Normal baseline was 50,000 entries.

Assumption

The team initially assumed the heap was simply too small for Black Friday traffic. They doubled -Xmx from 4GB to 8GB and redeployed. The service ran for 6 hours before crashing again. The second heap dump showed the same pattern — CheckoutSessionManager holding 14+ million entries.

Root cause

CheckoutSessionManager stored session objects in a ConcurrentHashMap with a user_id key. The session cleanup thread was supposed to evict expired sessions every 60 seconds. Under high load, the cleanup thread was starved — it ran on a shared thread pool with the request handlers. During peak traffic, the request threads consumed all CPU, and the cleanup thread never got scheduled. Sessions accumulated indefinitely. Each session object held references to the full cart, user profile, and payment token — approximately 500 bytes per entry. At 14.2 million entries, that was 7.1GB.

Fix

Replaced the cleanup-thread pattern with Caffeine cache using expireAfterAccess(30, TimeUnit.MINUTES). Caffeine handles eviction internally without a separate thread. Set maximumSize(500_000) as a hard cap. Added a monitoring alert when session count exceeds 100,000. The fix reduced steady-state memory from 4GB to 800MB and eliminated the leak entirely.

Key lesson

Doubling heap without understanding the leak just delays the crash and makes the heap dump twice as large to analyze. Find the leak first, then right-size the heap.
Never use a plain Map for session storage with manual cleanup. Use a cache library (Caffeine, Guava) with built-in TTL eviction.
Background cleanup threads on shared thread pools get starved under load. If eviction is critical, give the cleanup thread a dedicated pool or use a library that does not need one.
Monitor object counts, not just heap usage. A service using 60% heap with 14 million Map entries is in worse shape than one using 80% heap with 50,000 entries.
Set a hard maximumSize on any unbounded collection that receives data from external sources. Unbounded growth is the root cause of most production OOMs.

Production debug guideSymptom-to-action guide for the memory issues you will actually encounter at 2 AM12 entries

Symptom · 01

java.lang.OutOfMemoryError: Java heap space — service crashes with OOM

→

Fix

Check if -XX:+HeapDumpOnOutOfMemoryError was set. If yes, analyze the heap dump with Eclipse MAT or jhat. Look at the dominator tree — the top object consuming memory is usually the leak source. If no heap dump was captured, add the flag immediately and wait for the next occurrence. In the short term, check jstat -gcutil to see if old gen is at 100% and not collecting.

Symptom · 02

java.lang.OutOfMemoryError: Metaspace — service crashes after multiple redeployments

→

Fix

Metaspace stores class metadata. A leak here means classloaders are not being garbage collected. Common in application servers (Tomcat, JBoss) with hot-redeploy. Check if your deployment pipeline redeploys without restarting the JVM. Fix: restart the JVM on redeploy, or investigate why old classloaders are still referenced. Increase -XX:MaxMetaspaceSize only as a temporary mitigation.

Symptom · 03

java.lang.OutOfMemoryError: Direct buffer memory — NIO or Netty service crashes

→

Fix

Direct memory is allocated outside the heap via ByteBuffer.allocateDirect(). The JVM tracks this separately. Check -XX:MaxDirectMemorySize (default is -Xmx value). Common cause: Netty ByteBuf not released, or NIO channels not closed. Use NativeMemoryTracking (NMT) with -XX:NativeMemoryTracking=detail to profile direct memory allocation. In Netty, enable ResourceLeakDetector at PARANOID level temporarily.

Symptom · 04

java.lang.StackOverflowError — thread crashes with deep recursion

→

Fix

Each thread has a fixed stack size set by -Xss (default 512KB-1MB depending on OS). The error means the call stack exceeded this size. Common cause: infinite recursion, or very deep recursive algorithms. Check the stack trace for repeating method signatures. Fix: convert recursion to iteration, or increase -Xss (costs more memory per thread — 1000 threads × 2MB = 2GB extra).

Symptom · 05

GC overhead limit exceeded — service becomes unresponsive, eventually OOM

→

Fix

The JVM spent more than 98% of the last few seconds doing GC and recovered less than 2% of heap. This means GC cannot free enough memory. Root cause is almost always a memory leak — objects are referenced and cannot be collected. Analyze heap dump for leak suspects. Temporary mitigation: increase -Xmx, but this only buys time. The fix is finding and eliminating the leak.

Symptom · 06

Service response time degrades gradually over hours, eventually OOM — no single leak object visible

→

Fix

This is a generational leak pattern. Objects promoted to old gen are never collected, but they are not a single large object — they are thousands of small objects from different code paths. Use jmap -histo:live periodically to track object count growth. Compare histograms over time. The object type with the fastest-growing count is the leak source. Check for unbounded caches, connection pools without limits, or thread-local variables not cleaned up.

Symptom · 07

Container killed by OOM killer (exit code 137) — no JVM OOM error logged

→

Fix

The OS killed the process because total memory (heap + off-heap + native) exceeded the container memory limit. The JVM did not OOM — the container did. Check if -Xmx is set to more than 75% of container memory. The remaining 25% covers thread stacks, metaspace, direct memory, JNI native memory, and OS page cache. Use NativeMemoryTracking to profile total JVM memory usage. Adjust container limit or reduce -Xmx.

Symptom · 08

GC pause times exceed 1 second — service has high tail latency

→

Fix

Full GC is pausing all application threads. Check which GC is active (PrintCommandLineFlags). If using Serial or Parallel GC, switch to G1GC (JDK 11+) or ZGC (JDK 15+). If already using G1GC, tune -XX:MaxGCPauseMillis (default 200ms), -XX:G1HeapRegionSize, and -XX:InitiatingHeapOccupancyPercent. Check if humongous allocations are causing premature GC — objects larger than 50% of a G1 region (default 1MB) are humongous.

Symptom · 09

Heap usage spikes to 90% then drops to 30% — normal or leak?

→

Fix

This is the expected sawtooth pattern IF the drop happens after a full GC cycle and returns to the same baseline. A leak shows as: baseline increases over time. Track old gen usage after each full GC. If post-GC old gen usage grows over hours/days, you have a leak. Use jstat -gcutil <pid> 1000 to monitor. The key metric is old gen usage after full GC, not peak usage.

Symptom · 10

Service runs fine with 100 TPS but OOMs at 1000 TPS — not a leak, just load?

→

Fix

Possibly, but verify. Check if heap usage after GC is the same at both load levels. If post-GC baseline is the same, the issue is allocation rate exceeding GC throughput. Options: increase heap, switch to a lower-latency GC (ZGC/Shenandoah), or reduce allocation rate by object pooling or caching. If post-GC baseline is higher at 1000 TPS, you have a load-dependent leak — objects are referenced longer under concurrency.

Symptom · 11

Memory usage grows slowly over days — no OOM yet, but trending upward

→

Fix

Early leak detection. Take periodic heap histograms with jmap -histo:live and compare. Use JFR (Java Flight Recorder) with -XX:StartFlightRecording to capture allocation patterns over time. Look for object types whose count increases monotonically. Set up monitoring alerts for old gen growth rate — alert if post-GC old gen grows more than 5% per hour.

Symptom · 12

OOM happens only in production, never in staging — same code, same -Xmx

→

Fix

Production data profiles differ from staging. Common causes: production has more unique users (larger session caches), more unique query patterns (larger query caches), or different traffic patterns (more concurrent connections). Compare object counts between environments using jmap -histo. The environment with higher counts reveals the data-dependent leak.

★ Quick Debug Cheat Sheet — Start Here When It Is 2 AMYou are on-call. The service is down. Use this to triage in under 60 seconds before diving deeper.

Pod killed (exit code 137) — no JVM error in logs−

Immediate action

Container OOM killer — total memory exceeded container limit

Commands

kubectl describe pod <pod> | grep -A5 "Last State"

kubectl top pod <pod> --containers

Fix now

Increase container memory limit to 1.43x your -Xmx, or reduce -Xmx to 70% of current container limit

java.lang.OutOfMemoryError: Java heap space+

Response times spiking — service is slow but not crashed+

CPU at 100% — service is thrashing+

java.lang.OutOfMemoryError: Metaspace+

java.lang.OutOfMemoryError: Direct buffer memory+

java.lang.StackOverflowError+

Service runs for hours then OOM — slow leak+

JVM Memory Issues Compared

Situation	Common Cause	Best Fix
OOM: Java heap space	Memory leak or undersized heap	Analyze heap dump with MAT. Find leak via dominator tree. Fix leak, then right-size heap.
OOM: Metaspace	ClassLoader leak in hot-redeploy environment	Restart JVM on redeploy. Avoid static references to dynamic classloaders. Use WeakHashMap.
OOM: Direct buffer memory	Netty/NIO buffer leak or insufficient MaxDirectMemorySize	Enable ResourceLeakDetector. Set MaxDirectMemorySize explicitly. Monitor with NMT.
GC overhead limit exceeded	Memory leak — GC cannot free enough memory	Analyze heap dump. Fix the leak. Increasing heap only delays the crash.
StackOverflowError	Infinite recursion or deep call stack	Convert recursion to iteration. Increase -Xss if deep recursion is intentional.
Container OOM kill (exit 137)	Total memory (heap + off-heap) exceeds container limit	Set container limit to 1.43x heap. Add NativeMemoryTracking. Monitor container memory.
GC pauses >1 second	Full GC on large heap with G1GC	Switch to ZGC (sub-ms pauses) or tune G1GC MaxGCPauseMillis and IHOP.
Memory grows but no single leak object	Distributed leak (ThreadLocal, unbounded cache)	Compare heap histograms over time. Check `ThreadLocal.remove()` and cache eviction.
OOM only at high traffic	Allocation rate exceeds GC throughput	Reduce allocation rate (object pooling, caching). Switch to higher-throughput GC.
OOM after code deployment	New code introduced leak or removed cleanup	Diff deployed code. Look for new caches, new ThreadLocal, removed eviction logic.
Heap at 80% but stable — no leak	Working set is legitimately large	Right-size heap. Working set × 2 is a good starting point. Not every high-usage is a leak.
Humongous allocations in GC logs	Objects >50% of G1 region size	Increase G1HeapRegionSize or refactor large byte[]/StringBuilder allocations.
SoftReference cache consuming all heap	JVM only collects SoftReferences under heap pressure	Switch to size-bounded cache (Caffeine) with explicit eviction.
Netty buffer pool growing unbounded	PooledByteBufAllocator retains buffers under load	Set maxOrder limit. Monitor pool metrics. Use -XX:MaxDirectMemorySize.

⚙ Quick Reference

13 commands from this guide

File	Command / Code	Purpose
symptom_tool_map.txt	SYMPTOM \| WHAT TO CHECK \| TOOL ...	Production Debugging Quick Map
jvm_debug_commands.sh	PID=$(pgrep -f 'java.*-Xmx') # Find JVM PID	Essential JVM Debug Commands
oom_type_detector.java	/**	Understanding the Five OOM Types
heap_dump_analyzer.java	/**	Heap Dump Analysis
gc_analyzer.java	/**	GC Tuning
leak_detector.java	/**	Memory Leak Patterns and Detection
production_jvm_flags.sh	JVM_BASE_FLAGS="	Production JVM Configuration
off_heap_monitor.java	/**	Off-Heap Memory
memory_metrics_exporter.java	/**	Building a Production Memory Monitoring Stack
MemoryRegionsCheck.java	public class MemoryRegionsCheck {	Memory Structure
GenerationSizing.java	public class GenerationSizing {	Heap Generations
HeapMonitor.java	public class HeapMonitor {	JConsole and VisualVM
SharedRegistry.java	public class SharedRegistry {	Unregister Listeners and Callbacks

Key takeaways

Five OOM types, five different causes, five different diagnostics. Match the error message to the correct tool before debugging.

Two heap dumps 30-60 minutes apart reveal leaks that a single dump cannot. Compare histograms to find the fastest-growing object type.

GC tuning is about trade-offs

throughput vs latency vs memory. Batch = Parallel GC. Real-time = ZGC. Everything else = G1GC.

Monitor post-GC old gen baseline, not peak heap usage. A rising baseline confirms a leak. Peak usage is irrelevant for leak detection.

Set container memory to 1.43x your heap size. Off-heap memory (thread stacks, metaspace, direct buffers) is invisible to heap monitoring but visible to the container OOM killer.

ThreadLocal and unbounded caches are the most common production leak sources. Always call ThreadLocal.remove() in a finally block. Always set maximumSize on caches.

Three non-negotiable production flags

-XX:+HeapDumpOnOutOfMemoryError, GC logging, and -XX:+ExitOnOutOfMemoryError. Without them, you are flying blind.

Enable NativeMemoryTracking to profile off-heap memory. Container OOM kills with normal heap usage indicate off-heap pressure.

Netty's PooledByteBufAllocator retains direct buffers even after release. Monitor pool metrics and set explicit MaxDirectMemorySize.

Keep GC logs and heap dumps for 7+ days. Memory leaks take days to manifest. 24-hour retention loses the trend data needed for diagnosis.

Print the symptom-to-tool map and the five essential commands. When the alert fires at 2 AM, you need to triage in 60 seconds, not 45 minutes.

Run fast commands first (jmap -histo, jstat -gcutil). Run slow commands (jmap -dump) only when fast commands do not reveal the issue.

Common mistakes to avoid

20 patterns