Senior 29 min · March 06, 2026
Java Profiling and Performance

G1 GC Concurrent Mode Failure — Fix 20s STW Pause

At 8k req/min, G1 GC's default IHOP=45 triggered Concurrent Mode Failure, causing 20-second pauses and 503 errors.

N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Lessons pulled from things that broke in production.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Java profiling measures CPU, memory, thread, and I/O usage in a live application
  • JFR (Java Flight Recorder) is built-in, low-overhead event recording for continuous monitoring
  • async-profiler combines CPU sampling, allocation profiling, and wall-clock profiling in one agent
  • Flame graphs visualize call stacks as rectangles; width = time spent, color = function type
  • Biggest mistake: optimizing without profiling first — you'll guess wrong every time
  • Performance insight: JFR targets <1% overhead; async-profiler adds ~2% during sampling
✦ Definition~90s read
What is Java Profiling and Performance?

G1 GC Concurrent Mode Failure is the garbage collector's emergency brake — it happens when the concurrent marking phase can't finish before the heap runs out of free regions for new allocations. The JVM then falls back to a single-threaded, stop-the-world (STW) full GC that can freeze your application for 20 seconds or more.

Imagine your Java app is a restaurant kitchen.

This isn't a tuning knob gone wrong; it's a fundamental capacity signal that your live data set exceeds what G1 can handle concurrently. In production, this manifests as sudden latency spikes, dropped connections, and timeout cascades — the kind of outage that wakes you at 3 AM.

To diagnose this, you need two tools: JFR (Java Flight Recorder) for zero-overhead GC event tracing, and async-profiler for CPU flame graphs that show where your allocation pressure is coming from. JFR's G1-specific events like G1EvacuationPause and ConcurrentModeFailure give you exact timestamps and heap occupancy at failure.

Async-profiler's allocation profiling mode (-e alloc) reveals the call stacks responsible for the bulk of object churn. Together, they tell you whether the problem is a memory leak (objects accumulating in long-lived collections) or a transient allocation storm (bursty request patterns).

Heap analysis with a tool like Eclipse MAT or JProfiler is the next step — you're looking for the 'dominant' object paths that consume 80% of the retained heap. Common culprits are unbounded caches, thread-local buffers that never clear, or ORM session objects holding references to entire result sets.

GC tuning (increasing heap size, adjusting -XX:G1HeapRegionSize, or raising -XX:InitiatingHeapOccupancyPercent) is a temporary bandage; the real fix is reducing allocation rate or fixing the leak. In production, never attach a profiler without -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints and always use jcmd or async-profiler's start/stop commands to avoid safepoint bias.

The golden rule: profile for 30 seconds, not 30 minutes — you're hunting a symptom, not collecting a baseline.

Plain-English First

Imagine your Java app is a restaurant kitchen. Orders are coming in, but food is taking forever to reach customers. Profiling is like installing cameras and timers on every chef, every station, and every oven — so you can see exactly WHO is slow, WHERE the bottleneck is, and WHY the kitchen is on fire. Without profiling, you're just guessing. With it, you walk straight to the broken fryer.

Most Java performance problems don't announce themselves. They show up as mysterious latency spikes at 2am, a heap that grows 10MB per hour until the app dies, or a thread pool that silently saturates under load while your dashboards look green. These aren't bugs in the traditional sense — they're invisible tax your code pays at scale, and they're the kind of thing that separates senior engineers from everyone else.

Profiling is the discipline of measuring your running application to find exactly where CPU time, memory, threads, and I/O are going — before you optimize anything. The cardinal sin in performance work is optimizing without data. You'll almost always guess wrong, spend a week tuning the wrong method, and make the codebase harder to read for zero gain. Good profiling tools give you a flame graph that says 'this one method accounts for 43% of your CPU' — and suddenly the path forward is obvious.

This guide gets straight to the point: which profiling tool for which symptom, how to read flame graphs and heap dumps without getting lost, the production safeguards that prevent profiling from becoming the cause of your next incident, and the one workflow that makes performance tuning repeatable. Time to get into the details.

Here's the hard truth: most teams waste weeks on performance work because they skip the baseline. You need to know what 'normal' looks like before you can spot abnormal. That means setting up JFR on day one, even if you're not debugging anything. The data you collect when everything is fine is your most valuable asset when things break.

What G1 GC Concurrent Mode Failure Really Means

Java profiling performance is the practice of measuring and analyzing JVM behavior under load to identify bottlenecks, memory pressure, and GC overhead. The core mechanic involves sampling thread stacks, heap usage, and GC pause times to correlate application latency with JVM internals. Without profiling, you're guessing at root causes.

In practice, profiling focuses on three properties: allocation rate (MB/s), object lifetime distribution, and GC cycle frequency. A high allocation rate forces the garbage collector to work harder, increasing pause times. For G1, the critical metric is the time between concurrent marking cycles — if the heap fills before marking completes, you get a Concurrent Mode Failure and a full STW pause.

Use profiling when latency spikes exceed 100ms, throughput drops, or you see GC logs with 'to-space overflow' or 'Concurrent Mode Failure'. In real systems, a 20-second STW pause during a Black Friday event can cost millions. Profiling tells you whether to tune heap size, adjust IHOP, or rewrite allocation-heavy code paths.

Don't Tune Blind
Concurrent Mode Failure is not a heap size problem alone — it's a symptom of allocation rate exceeding the concurrent collector's throughput. Always profile allocation rate first.
Production Insight
A payment processing service saw 20-second STW pauses every 3 hours during peak load.
The G1 log showed 'Concurrent Mode Failure' with 'to-space overflow' right before each pause.
Rule: If allocation rate exceeds 500 MB/s, G1's concurrent marking cannot keep up — increase heap or reduce allocation rate before tuning GC flags.
Key Takeaway
Profiling reveals allocation rate, not just heap usage — that's the real driver of GC pauses.
Concurrent Mode Failure means the collector lost the race against allocation — fix the rate, not the flags.
Always measure pause time distribution (P99, P99.9) before and after any GC tuning.
G1 GC Concurrent Mode Failure Resolution THECODEFORGE.IO G1 GC Concurrent Mode Failure Resolution From detection to fix: profiling, heap analysis, and tuning Concurrent Mode Failure STW pause >20s when GC can't keep up JFR & async-profiler Core tools to capture GC events and CPU Flame Graph Analysis Identify allocation hotspots and GC threads Heap Dump & OQL Find memory leak via object query language GC Tuning Adjust heap size, region count, and thresholds Stable Production Eliminate concurrent mode failures ⚠ Profiling in containers: missing /proc/self/maps Use -XX:+UseContainerSupport and mount /proc THECODEFORGE.IO
thecodeforge.io
G1 GC Concurrent Mode Failure Resolution
Java Profiling Performance

Core Profiling Tools: JFR and async-profiler

Two tools dominate modern Java profiling: Java Flight Recorder (JFR) and async-profiler. JFR is built into the JDK since version 11 (formerly commercial-only). It records fine-grained events — GC pauses, thread allocations, JIT compilations, IO operations — with less than 1% overhead. You start it with jcmd <pid> JFR.start and dump a recording file later. No JVM restart needed.

async-profiler is an open-source agent that uses a combination of perf_events (on Linux) and a custom JVMTI agent to produce CPU and allocation flame graphs. It's the go-to tool for ad-hoc profiling because you can attach it to a running process, collect a 30-second sample, and get an interactive HTML flame graph you can share with the team. It supports two sampling modes: CPU (only samples running threads) and wall (samples all threads, including those blocked on IO or locks).

The choice depends on your use case: JFR for continuous, always-on monitoring (think of it like a JVM black box); async-profiler for targeted investigations when you suspect a specific function or class.

One important nuance: JFR's profile.jfc template samples method stacks at a fixed frequency, while async-profiler uses kernel sampling via perf_events. The two approaches can give different results for very short methods. Always validate hot spots with both tools if the impact is high.

Production trap: async-profiler's CPU mode relies on perf_events, which can conflict with container CPU limits in Kubernetes. If your pod is throttled, you'll see inflated CPU percentages in the flame graph. Always correlate with host-level metrics.

There's a nuance though: if you're in a container without SYS_ADMIN, async-profiler CPU mode won't work. Fall back to -e itimer or JFR. We'll cover container profiling later.

One more thing: don't sleep on JFR's event streaming API. Since JDK 14, you can subscribe to JFR events programmatically — no dump files, no file I/O. You get a live stream of GC pauses, allocation ticks, and lock contention as they happen. It's the best way to build custom monitoring without adding agents.

Another hidden feature: JFR can record network I/O and file system events. Use the -XX:+UnlockDiagnosticVMOptions -XX:+FlightRecorder flags to enable socket reads/writes in the recording. This helps when the bottleneck is external.

JFR's event streaming API is your secret weapon for real-time observability — use it to trigger alerts on allocation rate spikes without polluting your logs.

Also consider combining both tools: run JFR continuously, and when you see an anomaly in the JFR dashboard, attach async-profiler for a deep dive. This hybrid approach gives you both broad coverage and surgical precision. The cost is minimal — JFR overhead is <1% and async-profiler is temporary.

Production story: A team used async-profiler CPU mode and saw no hot methods. But JFR's allocation profile revealed a massive allocation rate from a logging framework at DEBUG level. They turned down log verbosity, saving 20% CPU — a fix that CPU profiling alone couldn't find.

profiling_commands.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Start JFR recording with default template (low overhead)
jcmd <pid> JFR.start name=monitor duration=60s filename=/tmp/recording.jfr

# Dump an active recording without stopping it
jcmd <pid> JFR.dump name=monitor filename=/tmp/recording.jfr

# CPU profile with async-profiler for 30 seconds
./profiler.sh -e cpu -d 30 -f flame_graph.html <pid>

# Allocation profile (requires async-profiler's alloc mode)
./profiler.sh -e alloc -d 30 -f alloc.html <pid>

# Wall-clock profile (includes blocked threads)
./profiler.sh -e wall -d 30 -f wall.html <pid>
Tool Selection Rule
If you don't know which tool to use, start with JFR. It records everything with <1% overhead and you can analyze it later. Flame graphs are faster to read but you need to know what to look for.
Production Insight
JFR overhead is <1% even with all events enabled.
async-profiler's CPU mode uses perf_events on Linux — it can conflict with container CPU limits.
Rule: always test profiling overhead on a canary instance before attaching to production traffic.
JFR's event streaming API allows real-time allocation rate alerts — use it to catch leaks early.
async-profiler's wall-clock mode is essential for lock contention; CPU mode alone gives a false sense of low CPU.
JFR's streaming API can alert on allocation rate spikes — integrate with Prometheus via jfr-exporter.
Key Takeaway
Use JFR for continuous background monitoring.
Use async-profiler for targeted, ad-hoc diagnostic sessions.
The combo covers every common profiling scenario without restarting your JVM.
JFR's event streaming API is your real-time observability layer — use it.
Don't forget to test both tools in your specific environment before an emergency.
Choose Your Tool: JFR vs async-profiler
IfContinuous monitoring for historical analysis
UseUse JFR with default template (24/7).
IfAd-hoc investigation of a specific symptom
UseUse async-profiler with the relevant event (cpu/alloc/wall).
IfNeed both long-term recording and instant flame graphs
UseRun JFR continuously and async-profiler on demand.
IfNeed to correlate GC pauses with application latency
UseUse JFR to get GC event timestamps and match with request latency.

Reading Flame Graphs: What to Look For

A flame graph is a visual representation of a stack trace sample set. The x-axis groups stack frames alphabetically, and the width of each rectangle is proportional to the number of samples that included that frame. The y-axis is the stack depth — the top is the function actually running, and below it are its callers. Color typically indicates the function type (red for native, yellow for Java, green for interpreted, blue for GC, etc.).

When reading a flame graph, start at the top and look for the widest frames. Those are your hot spots. A common trap is staring at a wide frame that's a low-level method like Unsafe.park() — that's not the culprit; its caller is. Always trace wide frames upward to find the application code that triggers them. If the graph shows many thin, tall towers, you have deep call stacks — often recursion or poorly designed frameworks. If the graph looks like a plateau (many wide frames at similar depth), you have multiple hot spots.

Another pattern: a 'mountain' shape with a single wide top — that's your one bottleneck. A 'volcano' with multiple peaks — load is spread across several paths; optimising any one may shift the bottleneck without much improvement overall.

Real-world mistake: One team saw a wide frame for java.util.HashMap.put() and assumed they needed a faster hash map. But the flame graph showed it was called from a logging framework at DEBUG level. Turning down log verbosity fixed the CPU usage in 30 seconds. The frame width lies without context.

You'll also encounter 'icicle' graphs (inverted) where the root is at the bottom — those are common in async-profiler's output. Same reading technique: look for widest at the top of the icicle.

Here's a rule I've learned the hard way: always generate a flame graph during both peak and off-peak load. A flame graph from a quiet period shows you nothing useful. The hot spots only reveal themselves under pressure. Profile under load or don't profile at all.

Also note: flame graphs can be misleading for lock contention. A thread blocked on a lock doesn't appear CPU-sampled. That's why you need wall-clock mode. Always cross-reference flame graphs with thread dumps.

Short-lived methods can be invisible in CPU flame graphs — use JFR to capture them. Every hot spot you fix changes the shape of the graph; retest after each optimisation.

One advanced tip: use differential flame graphs to compare two profiling sessions. For example, compare before and after a deployment. The diff shows exactly which methods got hotter or cooler. This is invaluable for detecting performance regressions. async-profiler doesn't generate diffs natively, but you can use the FlameGraph toolkit's difffolded.pl script.

Another common trap: reading a flame graph from a single sample. Always take multiple samples over time to see the trend. A 30-second sample may miss intermittent spikes.

flame_graph_reading.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Generate a flame graph from async-profiler output
./profiler.sh -e cpu -d 30 -f raw.txt <pid>
# Convert to SVG using FlameGraph toolkit
stackcollapse-perf.pl raw.txt | flamegraph.pl > flame.svg

# Or use async-profiler's built-in HTML output (easier to share)
./profiler.sh -e cpu -d 30 -f flame.html <pid>

# Differential flame graph (requires two folded stacks)
./profiler.sh -e cpu -d 30 -f before.txt <pid_before>
./profiler.sh -e cpu -d 30 -f after.txt <pid_after>
stackcollapse-perf.pl before.txt > before.folded
stackcollapse-perf.pl after.txt > after.folded
difffolded.pl before.folded after.folded | flamegraph.pl --negate > diff.svg
Flame Graph Shapes
  • Mountain shape: one dominant bottleneck — optimize this method.
  • Volcano shape: multiple hot spots — improving one may shift bottleneck.
  • Plateau shape: many methods consuming roughly equal time — focus on allocation or I/O.
Production Insight
Flame graphs aggregate over time — they can hide short-lived spikes.
Always correlate with latency metrics: a wide frame may be innocent if it runs during idle periods.
Biggest mistake: interpreting width as 'bad' — wide may mean nothing if the function is expected to take time (e.g., waiting on DB).
Short methods can be invisible; use JFR's method profiling for complete picture.
Differential flame graphs are the best tool for catching regressions — use them after every deployment.
Differential flame graphs should be part of every post-deploy verification.
Key Takeaway
Look for widest frames at the top.
Trace them upward to find the application code that calls them.
Correlate flame graphs with latency percentiles, not just averages.
A mountain shape is easier to fix than a plateau.
Use differential flame graphs to spot regressions instantly.

Heap Analysis: Finding the Leak

Memory leaks in Java are almost never about unreachable objects — those get GC'd. The real leaks come from accidental retention: objects that remain reachable but are no longer needed. Common patterns include static collections (caches without eviction), thread-local variables that accumulate, or JDBC statements not closed. Profiling a leak means taking a heap dump at a point when you suspect the heap has grown, then analyzing it to find the 'dominator' objects that hold the most memory.

Eclipse MAT (Memory Analyzer Tool) is the de facto standard for this. You load a heap dump, run the 'Leak Suspects' report, and it highlights the biggest retained sets. The 'Dominator Tree' view shows which objects would be freed if a given root were removed. For example, if a single HashMap holds 90% of the heap with 2 million stale entries, that's the leak.

Important: take a live heap dump (jmap -dump:live) to exclude unreachable objects. Full heap dumps include all objects and take much longer to process. Always capture a few dumps over time to see the growth rate — one snapshot can't tell you if the growth is a leak or just a large but stable cache.

Also consider using JFR's allocation profiling to find the call sites that produce the most garbage. Sometimes the fix is not to remove the collection but to reduce its creation rate.

Production nuance: A heap dump from a process that's about to OOM might be truncated — critical objects could be missing. Always capture a second dump after recovery to compare. Also, the Leak Suspects report is a heuristic; always verify by examining the retaining stack traces in the dominator tree.

I once saw a Leak Suspects report point at a HashMap in logging, but the real culprit was a thread-local cache in the user session handler. Always cross-check with the thread overview and dominator tree.

Here's the uncomfortable truth about heap analysis: by the time you notice the leak, it's been running for days. You need to calculate the leak rate. Take a dump, wait an hour, take another. If the retained heap of a suspect class grew by 200MB, you have a leak rate of ~3.3MB/min. That number tells you how long you have before the next OOM — and whether a hotfix can wait until the next release cycle.

One more tip: use jcmd instead of jmap on JDK 11+. It's faster, safer, and doesn't force a GC unless you specify live.

Leak rate calculation is your timeline to failure — use two dumps taken an hour apart to compute it. A leak rate of 5MB/min means you have hours, not days.

For advanced users: combine heap analysis with allocation profiling. Use async-profiler's -e alloc to see which code paths create the most objects, then correlate those call sites with the objects found in the heap dump. This cross-reference is far more powerful than either technique alone.

A story: A team's leak suspect report showed a HashMap in logging, but after using OQL to find the largest entries, they discovered it was a cache in their session handling. The fix was adding TTL eviction, reducing heap growth from 10MB/min to 0.

heap_analysis.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Trigger a live heap dump (forces GC first)
jmap -dump:live,format=b,file=heap_dump_$(date +%s).hprof <pid>

# Or use jcmd (preferred for JDK 11+)
jcmd <pid> GC.heap_dump /tmp/live_dump.hprof

# Automatically dump on OutOfMemoryError (add to JVM flags)
# -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/

# Analyze with Eclipse MAT (GUI) or OQL
# MAT -> File -> Open Heap Dump -> Leak Suspects Report

# Allocation profile with async-profiler
./profiler.sh -e alloc -d 30 -f alloc.html <pid>
Don't Forget Allocation Profiling
Heap dumps show you what's alive, not where it was created. Use async-profiler's -e alloc mode alongside heap dumps to pinpoint the code that allocates the most memory.
Production Insight
A heap dump from a process that's about to crash may be truncated — important objects may be missing.
Always take a second dump after the first to confirm the trend.
The 'Leak Suspects' report is a guesser — verify by examining the retaining stack traces in the dominator tree.
Compute leak rate by comparing two dumps taken an hour apart.
Combine heap dumps with allocation profiling for a complete picture.
Always take two heap dumps a fixed interval apart to compute leak rate.
Key Takeaway
Leaks hide in accidental retention, not unreachable objects.
Take live dumps with jcmd or jmap -dump:live.
Use MAT's Dominator Tree to find the one HashMap or ArrayList that holds everything.
Pair with allocation profiling to find the source.
Don't trust the Leak Suspects report blindly — verify with the dominator tree.
Calculate leak rate: two dumps, one hour apart, retained heap delta.
Choosing Heap Analysis Technique
IfSuspected leak from a static collection
UseUse MAT Dominator Tree to find the collection's retained heap.
IfSuspected large objects (e.g., strings, byte arrays)
UseUse OQL to find instances with @usedHeapSize > threshold.
IfSuspected thread-local accumulation
UseUse MAT thread overview to examine each thread's retained heap.

GC Tuning: The Production Reality

GC tuning is the most overrated performance activity. Most applications need none — default settings with G1 (Java 9+) or Parallel (pre-9) work fine up to moderate loads. Tuning only matters when you have evidence from profiling that GC is causing latency or throughput issues. That evidence comes from JFR GC events or from explicit GC logs.

When you do need to tune, the three most impactful knobs are: 1. Heap size (-Xms, -Xmx): Too small causes frequent GC, too large causes long pauses. Start with matching initial and max to reduce resizing overhead. 2. Pause time goal (-XX:MaxGCPauseMillis for G1): The GC tries to keep pauses under this, but may increase frequency or reduce throughput to meet it. A tight goal (like 10ms) forces more minor GCs. 3. InitiatingHeapOccupancyPercent (G1): When the heap occupancy after marking reaches this threshold, G1 triggers concurrent marking. Lower it to start earlier (reduces risk of concurrent mode failure). The default is 45%; reducing to 30% gives more time for concurrent work.

For ZGC and Shenandoah, the story is different: they aim for sub-millisecond pauses at the cost of some CPU overhead (ZGC uses load barriers, Shenandoah uses forwarding pointers). They shine on very large heaps (100GB+) but have higher baseline CPU usage. Profile your allocation rates first — if you're allocating 50GB/min, no GC will be happy.

One subtle trap: the -XX:MaxGCPauseMillis flag is a goal, not a guarantee. G1 will adjust the region set to try to meet it, but under high allocation pressure it may not be achievable. Monitor gc+pause logs to see if the target is consistently missed.

Real example: A team set MaxGCPauseMillis=50 on a 32GB heap with 80GB/min allocation rate. G1 could not meet the target and started triggering back-to-back young GCs, causing 30% throughput loss. They had to increase the pause target to 200ms and optimise allocation rates instead.

Default G1 settings assume moderate allocation rates (~100MB/s). If you're allocating >500MB/s, you need to tune regardless of heap size. The allocation rate is the real determinant of GC pressure.

Here's what most guides won't tell you: -XX:G1HeapRegionSize matters more than people think. On a 64GB heap with default 2MB regions, you get 32000 regions. That's a lot of tracking overhead. Bump it to 16MB or 32MB. Fewer regions, less bookkeeping, better pause predictability. I've seen this single change reduce mixed GC pauses by 40%.

Allocation rate is the real enemy — if it's >500MB/s, GC tuning alone won't save you. Profile allocation sources and fix them before touching GC flags.

Another subtle point: the -XX:+UseStringDeduplication flag can reduce memory usage in applications with many duplicate strings. But it adds CPU overhead. Profile before enabling to ensure net gain.

Don't forget to check your GC logs for promotion failures. If young objects are being prematurely promoted to old gen due to small survivor spaces, you'll see increased full GCs. Adjust -XX:SurvivorRatio or -XX:NewRatio to give young gen more room.

A story: a team was seeing 20s STW pauses. They reduced IHOP from 45 to 30 and increased heap from 4GB to 8GB. Pauses dropped to <100ms. They also fine-tuned G1HeapRegionSize from 2MB to 16MB, further reducing mixed GC pauses by 40%. The key was profiling allocation rates first.

gc_tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Enable GC logging (JDK 11+ unified logging)
-Xlog:gc*:file=gc.log:time,utctime,level,tags

# Tuning flags for a 16GB heap with tight pause targets
-Xms16g -Xmx16g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=35
-XX:+UnlockExperimentalVMOptions
-XX:G1NewSizePercent=5
-XX:G1HeapRegionSize=16m

# Monitor GC pause times from JFR
jcmd <pid> JFR.start name=gcrecord duration=120s filename=gc.jfr settings=profile
The 10% Rule
If your GC pause times are under 10% of the total CPU time, tuning GC will have negligible impact. Focus on reducing allocation rates instead.
Production Insight
Default G1 settings work for 90% of apps — don't tune unless profiling shows GC as a top 3 hot spot.
Concurrent mode failure is the production killer; always monitor GCCause in logs.
ZGC and Shenandoah reduce pauses but steal CPU cycles — profile the trade-off before switching.
G1HeapRegionSize tuning can reduce mixed GC pauses by 40% on large heaps.
Allocation rate >500MB/s requires code fixes, not just GC flags.
G1HeapRegionSize tuning is underutilized; test with 16MB or 32MB for large heaps.
Key Takeaway
GC tuning is the last resort after CPU and allocation profiling.
Tune only the three knobs: heap size, pause time goal, and IHOP.
Always validate changes under production load, not synthetic benchmarks.
If you're allocating faster than GC can collect, tune the allocation first.
Monitor promotion failures and survivor space sizing — they cause hidden full GCs.
Allocation rate is the real enemy; tune code before GC flags.
GC Tuning Decision Flow
IfGC pauses exceed latency SLO
UseCheck heap size first; if already large, reduce MaxGCPauseMillis or switch to ZGC.
IfConcurrent mode failure observed
UseReduce InitiatingHeapOccupancyPercent (try 30% to 35%).
IfAllocation rate >500MB/s with high GC throughput
UseReduce allocation rate before tuning GC — profile call sites and pool objects.

Production-Safe Profiling: Do's and Don'ts

Profiling in production requires caution. The wrong tool or command can pause your JVM for seconds or even minutes. Here's what works safely:

Do use: jcmd for JFR commands (no JVM pause), async-profiler with perf_events (zero overhead when not sampling), and jstack for thread dumps (pauses the target thread briefly, but acceptable).

Don't use: jmap -histo without :live — it does NOT trigger a GC and gives you all objects including garbage, misleading. jmap -clstats and jhat are deprecated and can slow down the JVM. Avoid attaching old JVMTI agents (like HPjmeter) that require the JVM to be started with -agentpath.

Rule of thumb: If a profiling command requires you to add JVM flags and restart, test it on staging first. If it attaches to a running process and claims <5% overhead, it's likely safe for production. Always start with a 10-second sample on a single instance, verify the JVM doesn't backpressure, then expand to longer durations.

Never profile every instance in a cluster simultaneously — the aggregate overhead can saturate host resources and cause a cascading failure.

Real production horror: An engineer attached async-profiler to all 20 instances of a payment service simultaneously. The CPU overhead from perf_events caused cascading timeouts. The team had to kill the profiler and restart half the cluster. Rule: one instance at a time.

I once saw an engineer attach async-profiler to a JVM that had 95% heap usage. The profiler triggered additional memory allocation and the process OOM'd within 30 seconds. Rule: never profile a JVM that's over 80% heap. The one command that's safe in any state is jstack. Even when heap is 99% full, jstack still works.

Another thing nobody tells you: JVM TI agents (including async-profiler) can cause transient performance degradation during attachment and detachment. The JVM needs to safepoint all threads to load the agent. On a 64GB heap with 200 threads, that safepoint can take 200-500ms. Schedule your profiling sessions during maintenance windows or off-peak hours.

Also watch out: if you're using jattach (the default async-profiler attach method), it uses a Unix domain socket. In some container environments (like those with read-only root filesystem), jattach may fail. In that case, use the -f (file) option or fallback to JFR.

Never profile a JVM over 80% heap — you'll push it into OOM. jstack is the only safe command in critical state.

One more safe practice: use the --sync flag with async-profiler to delay sampling until the profiler is fully attached. This avoids the initial safepoint overhead being captured as part of the profile. For example: ./profiler.sh -e cpu -d 30 --sync -f flame.html <pid>.

Critical: Know When to Walk Away
If your JVM is already in a critical state (heap near OOM, threads deadlocked, CPU pinned), do NOT attach any profiling tool. The extra allocation or bytecode instrumentation can push it over the edge. Instead, take a heap dump (jmap -dump:live) or thread dumps (jstack) — these are safe even under duress — and analyze offline.
Production Insight
The safest profiling is no profiling at all when the JVM is more than 80% heap usage.
Always have a rollback plan: know how to detach a profiler quickly.
Monitor host CPU after attaching async-profiler — it can cause additional load.
Jattach may fail in read-only container filesystems — test your attach method in staging.
Use the --sync flag to avoid capturing safepoint overhead in your profile.
On a 64GB heap with 200 threads, agent attachment safepoint can take 200-500ms; plan accordingly.
Key Takeaway
Profiling in production is safe — with the right tools and caution.
Start small, test one instance, never profile the whole cluster at once.
When in doubt, fall back to safe commands: jcmd, jstack, and live heap dumps.
If the JVM is above 80% heap, don't profile — just dump and run.
Know your attach method — jattach may fail in containers.
Safe vs Unsafe Profiling Actions
IfJVM health is critical (>80% heap)
UseOnly use jstack and jmap -dump:live (no async-profiler).
IfJVM is stable but you need a quick flame graph
UseAttach async-profiler for 15 seconds on a canary instance.
IfNeed continuous monitoring with low risk
UseStart JFR recording with default template (never triggers STW).

Profiling in Containers: Docker and Kubernetes Pitfalls

Containerized Java apps introduce new profiling complications that can lead to false data or no data at all. The core issue: perf_events (used by async-profiler's CPU mode) are restricted inside containers unless the container runs with elevated privileges or specific sysctl settings.

In Docker, you need --cap-add=SYS_ADMIN or --security-opt seccomp=unconfined to allow async-profiler CPU profiling. Without that, the profiler will fail with "No access to perf events." The safer workaround is to use async-profiler's -e itimer mode, which uses a timer-based approach instead of perf_events — slightly less precise but works in unconfined containers.

In Kubernetes, the situation is trickier. Even with perf_event_open allowed, container CPU limits via CFS can cause the profiler to see inflated CPU percentages because the kernel throttles the container. Your flame graph might show a wide __schedule frame — that's the throttling, not your code. Always correlate with container CPU usage metrics from cgroups.

JFR works reliably in containers because it doesn't depend on perf_events. However, JFR recordings from inside a container reflect only the container's view of CPU and memory. If you have a shared node, the JFR data won't show other containers' resource contention. Use kubectl top or node-level Prometheus metrics to get the full picture.

Production story: A team saw a recurring "CPU spike every 5 minutes" in their Kubernetes dashboard. async-profiler showed sun.rmi.transport.tcp.TCPTransport.handleMessages at the top — it was JMX RMI heartbeat threads battling with CFS throttling. They switched to a non-blocking JMX connector and the spikes disappeared.

If async-profiler CPU mode fails in your container, the first thing to check is the container's capabilities. If you can't add SYS_ADMIN, switch to -e itimer or use JFR.

Here's the container profiling trap I keep seeing: teams deploy JFR but never look at the recordings because they don't have JDK Mission Control in their workflow. Set up automated JFR dump collection. Have a cron job copy the last hour of JFR data to object storage. When an incident happens, you have the evidence waiting for you.

Another nuance: some Kubernetes platforms (OpenShift, GKE sandbox) block perf_event_open entirely. In that case, JFR is your only option. Always test your profiling toolchain on your specific container platform before production emergencies.

Use cat /sys/fs/cgroup/cpu/cpu.stat to see throttled time — if nr_throttled > 0, CFS is impacting your CPU profile.

A common mistake: assuming that a container with 2 CPU cores has full access to both. If the CPU limit is set, CFS throttles the container when it exceeds its quota. This shows up as __schedule in flame graphs. The fix is to either increase CPU limits or reduce allocation rates to stay under the throttling threshold.

Always test profiler permissions in your container platform before production. Use cat /sys/fs/cgroup/cpu/cpu.stat to see if CFS throttling is affecting your profile.

container_profiling.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Docker: run with perf_events capability
docker run --cap-add=SYS_ADMIN -p 8080:8080 my-java-app

# Or use itimer mode (no special permissions needed)
./profiler.sh -e itimer -d 30 -f flame.html <pid>

# Kubernetes: enable perf_event_open (if container runtime allows)
# Add to container securityContext:
# securityContext:
#   capabilities:
#     add: ["SYS_ADMIN"]

# JFR always works — no special container setup needed
jcmd <pid> JFR.start name=container_profile duration=60s filename=contained.jfr

# Check CPU throttling status
cat /sys/fs/cgroup/cpu/cpu.stat
Container Profiling Layers
  • JFR: works inside any container, but doesn't see other pods' resource usage.
  • async-profiler perf_events: needs extra privileges; reflects container-limited CPU time.
  • async-profiler itimer: works without privileges; timer-based, slightly less accurate but safe.
  • Node-level tools (perf top, /sys/fs/cgroup): give the host-level perspective you need for cross-container contention.
Production Insight
async-profiler CPU mode fails silently in restricted containers — always test with -e itimer first.
Kubernetes CPU throttling can inflate CPU profiles; cross-check with kubectl top pod.
JFR is container-friendly but captures only the container's perspective — not host contention.
Check cat /sys/fs/cgroup/cpu/cpu.stat for throttled time.
Automate JFR dump collection to object storage — you'll need it during incidents.
Use cat /sys/fs/cgroup/cpu/cpu.stat to see if CFS throttling is affecting your profile.
Key Takeaway
Always verify profiler permissions inside containers before production emergencies.
Use JFR for safe container profiling; fall back to async-profiler's itimer mode if perf_events is blocked.
Never trust a container CPU flame graph without checking the container's CPU throttling metrics.
When in doubt, start with JFR — it always works in containers.
Automate JFR storage: cron job dump to object storage for post-incident analysis.
Container Profiling Route
Ifasync-profiler fails with 'No access to perf events'
UseUse -e itimer or switch to JFR.
IfContainer CPU throttling suspected
UseCross-check with kubectl top pod and node-level metrics.
IfNeed host-level view of CPU/memory contention
UseUse node-level perf top or cgroup stats on the host.

Advanced Heap Dump Analysis with OQL

Eclipse MAT's Leak Suspects report is great for a first pass, but sometimes you need surgical precision. OQL (Object Query Language) is a SQL-like query language for heap dumps that lets you find specific objects, count instances, explore references, and even compute retained sizes programmatically.

OQL is available in Eclipse MAT and also in jhat (deprecated). In MAT, open the heap dump, then click the 'OQL' tab. Common queries:

  • SELECT * FROM java.util.HashMap — lists all HashMap instances (useful for caching leaks).
  • SELECT toString(o), o.@usedHeapSize FROM java.lang.String o WHERE o.@usedHeapSize > 100000 — find large strings that might be eating memory.
  • SELECT * FROM io.thecodeforge.service.MyService s WHERE s.cache.@usedHeapSize > 500000000 — checks if a specific service's cache exceeds 500MB.
  • SELECT DISTINCT OBJECTS classOf(o) FROM OBJECTS (SELECT * FROM java.lang.Thread) — list all classes that hold references to threads (great for thread leak detection).

Production scenario: A team noticed heap growing slowly but the Leak Suspects report gave vague results. They ran an OQL query to find all objects of a specific logger class that had accumulated millions of entries due to a missing ttl. The OQL showed the exact count and the retaining call stack, leading to the fix within minutes.

OQL also supports path expressions: SELECT OBJECTS a FROM INSTANCEOF java.lang.ref.Finalizer a — shows all finalizable objects, a notorious source of delayed memory leaks.

Important: OQL queries can be slow on large dumps. Always filter with WHERE clauses and avoid unconstrained SELECT * on huge classes.

You can also use OQL to compute retained sizes programmatically without manually navigating the dominator tree. The @usedHeapSize pseudo-field is your friend.

Here's an OQL trick that's saved me hours: SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000. This finds individual map entries with large retained heaps — the exact objects you need to evict. Most leak investigations should start here, not with the Leak Suspects report.

One more advanced use: OQL's INSTRUMENTS clause lets you execute JavaScript-like expressions. For example, to find objects whose class name matches a regex: SELECT * FROM java.lang.Object o WHERE /Cache$/.test(o.class.name). This catches multiple cache implementations without spelling each one out.

Start your leak investigation with SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000 — it finds the exact entries to evict.

OQL also allows grouping: SELECT c.name, sum(c.@usedHeapSize) FROM OBJECTS (SELECT * FROM java.lang.Object) o LET c = o.@class GROUP BY c.name ORDER BY sum(c.@usedHeapSize) DESC gives you a ranked list of classes by total retained heap. This is faster than the Histogram view for large dumps.

A trick: Use SELECT * FROM INSTANCEOF java.lang.ThreadLocal to find ThreadLocal instances that may hold large objects.

heap_oql_queries.txtSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Find all HashMap instances with their size
SELECT toString(map), map.@usedHeapSize FROM java.util.HashMap map

-- Find strings bigger than 100KB
SELECT toString(s), s.@usedHeapSize FROM java.lang.String s WHERE s.@usedHeapSize > 100000

-- Find all instances of a specific class and see which objects reference them
SELECT OBJECTS ref FROM OBJECTS (SELECT * FROM io.thecodeforge.cache.SessionCache c) AS obj
  JOIN OBJECTS ref WHERE ref = obj

-- List all finalizable objects (potential leak via finalize())
SELECT * FROM INSTANCEOF java.lang.ref.Finalizer

-- Group by class and sum retained heap (faster than histogram for large dumps)
SELECT c.name, sum(c.@usedHeapSize) FROM OBJECTS (SELECT * FROM java.lang.Object) o LET c = o.@class
  GROUP BY c.name ORDER BY sum(c.@usedHeapSize) DESC

-- Find large Map entries (start here for leak investigation)
SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000
OQL for Thread Leak Detection
Run SELECT DISTINCT OBJECTS classOf(t.target) FROM INSTANCEOF java.lang.Thread t WHERE t.@usedHeapSize > 1000000 to find thread objects holding significant heap — a sign of thread-local accumulation.
Production Insight
OQL can pinpoint leaks that automated reports miss — especially when the leak is spread across many small objects.
Always prefix queries with SELECT * and add WHERE conditions to limit scope; unconstrained queries can take minutes on large dumps.
Use @usedHeapSize to get the retained size directly, avoiding manual dominator tree navigation.
Start with Map$Entry queries to find the largest individual entries.
Group-by OQL queries are faster than the Histogram view for identifying top-consuming classes.
OQL can find leaks that automated reports miss; always combine with allocation profiling.
Key Takeaway
OQL is your scalpel for heap analysis when automated reports aren't enough.
Learn 5 essential queries: find instances, compute retained size, trace references, list finalizers, and group by class.
Always filter large results — unconstrained queries are slow and produce noise.
The @usedHeapSize pseudo-field is faster than navigating the dominator tree manually.
Use grouped OQL to get a ranked list of top memory consumers.
When to Use OQL vs MAT Leak Suspects
IfLeak Suspects report is ambiguous or shows many suspects
UseRun OQL to find large objects of known suspect classes (e.g., HashMap, String).
IfSuspect thread leak (many threads or thread locals)
UseUse OQL to query for Thread instances with large retained heap.
IfWant to compute retained size of a specific service or component
UseUse OQL with @usedHeapSize on that class's instances.

Performance Tuning Workflow: A Step-by-Step Production Example

Now let's put everything together with a real workflow. Suppose you have a payment processing service that's showing 99th percentile latency of 2 seconds during peak hours. Here's the exact sequence:

  1. Start JFR recording on one canary instance. Use the profile template for 5 minutes. This captures GC events, allocation rates, thread CPU, and lock contention.
  2. Load the JFR dump in JDK Mission Control. Go to the 'GC Pauses' view. If you see pauses >100ms, you have a GC problem. If the allocation rate is >500MB/s, you have an allocation problem.
  3. If GC is not the dominant issue, attach async-profiler for a wall-clock sample. Look for wide frames at the top. If you see java.net.SocketInputStream.socketRead0 wide, the service is waiting on network I/O.
  4. If allocation is high, run ./profiler.sh -e alloc -d 30 -f alloc.html <pid>. The allocation flame graph will show which call sites create the most objects.
  5. After identifying the hot spot, implement the fix (e.g., cache, pool, reduce object creation).
  6. Redeploy the canary and repeat steps 1-2 with the same JFR settings. Compare the new recording with the baseline.

This loop — profile, diagnose, fix, verify — is the only reliable way to tune performance. Guessing leads to wasted sprints.

Here's a Java example of a method that's a common allocation hotspot and its fix:

```java // Before: creates StringBuilder on every call public static String before(String prefix, int id) { return prefix + \":\" + id; // compiles to new StringBuilder().append()... }

// After: use String.format? No, that creates even more objects. // The real fix: if called thousands of times per second, inline explicitly: public static String after(String prefix, int id) { // Explicit concatenation — javac may optimize to StringBuilder anyway // But if prefix is constant, cache the template: return prefix + \":\" + id; // Real fix for high-frequency: pass parts directly, avoid intermediate strings } ```

Remember: always measure before and after. A change that looks smart on paper may not move the needle.

In practice, 80% of improvements come from the first two hot spots. Once you've addressed those, diminishing returns set in fast. Stop when you meet SLO — don't over-optimise.

One more production lesson: never trust a microbenchmark. The way your code runs in a JMH harness is completely different from how it runs under real load with GC, JIT warmup, and memory pressure. Always validate optimizations in production with the full profiling pipeline. If a change shows no improvement in the JFR comparison, revert it. Code complexity without performance gain is a net negative.

A story: One team spent a week optimizing a method that accounted for 5% of CPU; they missed the real bottleneck in JDBC pooling. Profiling first would have saved them days. Another team over-optimized by adding complex caching that caused memory pressure. They had to revert. The lesson: stop when SLO is met.

io/thecodeforge/performance/AllocationFix.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package io.thecodeforge.performance;

public class AllocationFix {
    // Before: creates a new StringBuilder on every call
    public static String before(String prefix, int id) {
        return prefix + ":" + id;  // compiles to new StringBuilder().append()...
    }

    // After: avoid intermediate allocation by using explicit concatenation
    // (javac still creates StringBuilder, but at least it's clear)
    public static String after(String prefix, int id) {
        return prefix + ":" + id;
        // The real fix: if you can, pass the parts directly
    }

    // For high-frequency logging, cache the template
    private static final String TEMPLATE = "%s:%d";
    public static String withFormat(String prefix, int id) {
        return String.format(TEMPLATE, prefix, id);
        // Caution: String.format creates a Formatter and many objects
        // Always profile to see if this is actually faster
    }
}
The Profiling Loop
  • Step 1: Start with JFR (default or profile) for 5 minutes on a canary.
  • Step 2: Analyze the recording — find the top 3 hot spots by CPU, allocation, or latency.
  • Step 3: Implement one change per iteration; never batch multiple optimizations.
  • Step 4: Redeploy the canary and profile again with the same settings.
  • Step 5: Compare before/after recordings. Did the hot spot shrink?
Production Insight
Never make more than one performance change per deployment cycle.
Always run a before-and-after profiling session with identical settings.
Most performance gains (80%) come from fixing the top 2 hot spots.
The rest is diminishing returns — stop when you hit acceptable latency.
Don't over-optimise: once SLO is met, move on.
Never trust microbenchmarks — validate in production with real load.
Stop optimizing when SLO is met; over-optimization adds complexity without benefit.
Key Takeaway
Performance tuning is a three-step loop: profile, fix, verify.
Never batch changes — you won't know what worked.
80% of improvement comes from the first two hot spots.
When latency meets SLO, stop — don't over-optimize.
Validate every change in production with before/after JFR recordings.
Tuning Iteration Decision
IfAfter fix, hot spot moved elsewhere
UseThe bottleneck shifted — fix the new hot spot.
IfAfter fix, latency improved but still above SLO
UseContinue the loop; fix next hot spot.
IfAfter fix, latency within SLO
UseStop. Document the change and move on.

G1 GC Concurrent Mode Failure: Diagnosis and Tuning

Concurrent mode failure is G1 GC's worst-case scenario. It happens when the concurrent marking phase cannot finish before the old generation fills up. The JVM then falls back to a stop-the-world (STW) full GC, which compacts the entire heap and can pause application threads for seconds to tens of seconds.

The root cause is usually an allocation rate that exceeds the concurrent marker's throughput. G1's concurrent marking is designed to run in the background while the application continues. But if the application allocates faster than the marker can process, the heap occupancy rises past the threshold set by -XX:InitiatingHeapOccupancyPercent (IHOP, default 45%). Once occupancy exceeds IHOP, G1 triggers the concurrent cycle. If the cycle can't complete before the heap is completely full, concurrent mode failure occurs.

Diagnosis
  • GC logs will show 'Concurrent Mode Failure' followed by a full GC (e.g., 'Full GC (Allocation Failure)' with 'Pause Full (G1 Compaction Pause)').
  • JFR recordings will show a long pause event of type 'G1 Pause Full' with duration in seconds.
  • The allocation rate in the recording will be high, often >500MB/s.

Tuning knobs: 1. Reduce IHOP: Lower -XX:InitiatingHeapOccupancyPercent to start concurrent marking earlier. Values between 30% and 40% are common. This gives the marker more time to finish before the heap fills. 2. Increase heap size: More heap means more runway before saturation. If you have 4GB heap and allocate 500MB/s, the heap fills in ~8 seconds. With 8GB, you get ~16 seconds. 3. Increase heap region size (-XX:G1HeapRegionSize): Larger regions reduce the marking bitmap overhead and can speed up concurrent marking. For heaps >16GB, try 16MB or 32MB. 4. Increase the number of concurrent marking threads (-XX:ConcGCThreads): Default is often (ParallelGCThreads + 2) / 4. You can increase it, but be aware it steals CPU from application threads.

Production example: In the incident described earlier, the service had 4GB heap with default IHOP=45 and allocation rate ~300MB/s. The concurrent marking took ~8 seconds, but the heap filled in ~10 seconds. That 2-second gap was too tight. By reducing IHOP to 30, the concurrent cycle started earlier, and by increasing heap to 8GB, the filling time doubled to ~20 seconds, giving the marker plenty of room.

Prevention
  • Always monitor allocation rates along with GC logs. A sudden increase in allocation rate is an early warning.
  • Set up alerts on 'Concurrent Mode Failure' in GC logs or JFR events.
  • Use JFR's streaming API to trigger an alert if the allocation rate exceeds a threshold (e.g., 400MB/s) for more than 10 seconds.

Trade-off: Reducing IHOP means G1 spends more CPU on concurrent marking, which can reduce application throughput by 5-10% during marking. For most services, this is an acceptable trade-off to avoid STW pauses.

Remember: if you're seeing concurrent mode failure, the first thing to check is allocation rate. If it's >500MB/s, optimise allocation before tuning GC. The tuning buys time, but reducing allocation rate is the permanent fix.

g1_concurrent_tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Check GC logs for Concurrent Mode Failure
grep 'Concurrent Mode Failure' gc.log

# JFR recording to capture GC events
jcmd <pid> JFR.start name=g1profile duration=120s filename=g1_profile.jfr settings=profile

# Recommended flags after diagnosing CMF
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=30
-Xms8g -Xmx8g
-XX:G1HeapRegionSize=16m
-XX:+UnlockExperimentalVMOptions
-XX:G1NewSizePercent=5
-XX:MaxGCPauseMillis=100

# Monitor allocation rate from JFR (requires streaming API or analysis)
jfr-stream <pid> | grep allocation
Allocation Rate First
Before tuning IHOP or heap size, check the allocation rate using JFR or async-profiler alloc mode. If your allocation rate exceeds 500MB/s, no amount of GC tuning will eliminate concurrent mode failure permanently — you must reduce allocation at the source.
Production Insight
Concurrent mode failure is always a symptom of an allocation rate that outruns the concurrent marker.
Reducing IHOP to 30 gives the marker a head start but costs 5-10% CPU during marking.
Increasing heap size is a band-aid; the real fix is reducing allocation rate.
Monitor allocation rate trends: a gradual increase over weeks indicates a leak, not just a tuning issue.
Test IHOP changes under peak load; a setting that works at 5k req/min may fail at 10k.
Concurrent mode failure is preventable with proactive allocation rate monitoring.
Key Takeaway
Concurrent mode failure means G1 can't keep up — reduce IHOP and/or increase heap.
Allocation rate is the root cause; fix it before relying on tuning.
Monitor GC logs and allocation rate together to catch CMF early.
Prevention: set up JFR streaming alerts on allocation rate spikes.
Reducing IHOP is a trade-off: lower pause risk for higher CPU usage during marking.
Concurrent Mode Failure Action Plan
IfAllocation rate < 500MB/s and heap not near max
UseReduce IHOP to 30 and increase heap by 50%.
IfAllocation rate > 500MB/s
UseProfile allocation sources first; fix code before tuning GC.
IfAfter tuning, still seeing CMF
UseConsider switching to ZGC or Shenandoah for sub-millisecond pauses.

Why YourKit and JProfiler Still Matter in the async-profiler Era

You've seen the flame graphs. JFR gives you a firehose of metrics. So why would anyone drop $500 on a commercial profiler in 2025?

Because GUI profilers solve problems that sampling tools can't touch. When you're chasing a deadlock that only manifests under load, or you need to walk the allocation stack of a single request through 200 threads, clicking around a timeline beats grepping JFR dumps every time.

YourKit and JProfiler aren't 'alternatives' to async-profiler — they're complementary weapons. YourKit's CPU recording with call-tracing mode (not just sampling) catches low-frequency methods that flame graphs miss. JProfiler's heap walker lets you filter by classloader, which is the only sane way to find permgen leaks in legacy apps.

The trap? Running their full instrumentation in production will tank your throughput. Use them as surgical tools: attach to a staging box, profile the specific endpoint for 60 seconds, then detach. Never leave profiler agent JARs on your production classpath.

AttachYourKitRemote.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — java tutorial

import com.yourkit.api.*;

public class AttachYourKitRemote {
    public static void main(String[] args) throws Exception {
        // Connect to remote JVM on port 10001
        Controller controller = new Controller();
        controller.attach("myapp.example.com", 10001);
        
        // Start CPU profiling with call tracing (not just sampling)
        controller.startCPUProfiling(ProfilingModes.CPU_TRACING, null);
        
        Thread.sleep(30_000); // profile for 30 seconds
        
        String snapshotPath = controller.captureSnapshot();
        controller.stopCPUProfiling();
        System.out.println("Snapshot saved: " + snapshotPath);
    }
}
Output
Snapshot saved: /home/appuser/.yourtkit/snapshots/MyApp-2025-03-15.snapshot
Production Trap:
YourKit's default attach mode uses JVMTI agent that can pause your JVM for 200-500ms during CPU data capture. Always test with synthetic load first — or limit profiling windows to 30 seconds max.
Key Takeaway
Use commercial profilers for interactive debugging of concurrency bugs and heap walks; keep async-profiler for always-on production monitoring.

IntelliJ Profiler: When You're Too Lazy to Learn Another Tool

Let's be honest: half of you reading this are already inside IntelliJ IDEA Ultimate. The built-in profiler is not your fancy async-profiler flame graph setup, but it's zero-config and good enough for 80% of local debugging.

The secret most devs miss: IntelliJ's profiler can snapshot thread contention. Open the 'Threads' tab, click 'Snapshot Threads', and you'll see exactly which synchronized blocks are causing convoying. This is pure gold for diagnosing 'my app went from 50ms to 2s after adding a shared HashMap' questions.

It also supports CPU sampling and allocation recording without restarting. Right-click a test method → 'Profile with Async Profiler' → you get a lightweight flame graph in 10 seconds. No command-line flags.

Where it falls apart: heavy GC analysis and container profiling. IntelliJ's agent can freak out when the JVM runs with -XX:+UseContainerSupport inside Docker. The workaround? Use Docker's host networking mode for profiling sessions, or fall back to JFR for containerized workloads.

ThreadContentionSniffer.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// io.thecodeforge — java tutorial

public class ThreadContentionSniffer {
    private final Object lock = new Object();
    
    public void doWork() throws InterruptedException {
        synchronized (lock) {
            // Simulate a slow synchronized block
            Thread.sleep(200);
            computeExpensiveResult();
        }
    }
    
    private void computeExpensiveResult() {
        for (int i = 0; i < 1_000_000; i++) {
            Math.sin(i) * Math.cos(i);
        }
    }
    
    public static void main(String[] args) throws Exception {
        ThreadContentionSniffer sniffer = new ThreadContentionSniffer();
        Thread[] threads = new Thread[4];
        for (int i = 0; i < 4; i++) {
            threads[i] = new Thread(() -> {
                try {
                    sniffer.doWork();
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            });
            threads[i].start();
        }
        for (Thread t : threads) t.join();
    }
}
Output
// IntelliJ Thread Snapshot reveals:
// Blocked threads: 3
// Owner: main (holding lock)
// Blocked on: doWork() line 8 — synchronized (lock)
Senior Shortcut:
Before diving into async-profiler, run IntelliJ's 'CPU Profiler' on your failing integration test. It often pinpoints the exact method causing latency without any CLI setup. If the flame graph looks flat (no obvious hotspot), then escalate to JFR with event-based profiling.
Key Takeaway
IntelliJ Profiler is the fastest way to identify thread contention and local CPU hotspots — use it for initial triage, not final production diagnosis.

VisualVM: The Underdog for Post-Mortem Analysis

VisualVM is the 'free tier' that refuses to die. After Oracle killed it in JDK 9, the open-source community revived it on GitHub. It's ugly. The UI looks like a Java Swing app from 2003. But it's the best tool for heap dump analysis without paying for YourKit.

Here's the workflow nobody tells you: when your app OOMs at 3 AM, dump the heap with jmap -dump:live,file=/tmp/heap.hprof <pid>. Open it in VisualVM. Click 'Summary' → 'Find' → enter class names like 'byte[]' or 'String'. If you see gigabytes of char[] arrays, you've got a JSON parsing leak. If you see millions of java.util.HashMap$Node instances, someone forgot to set a proper equals() on a custom key.

VisualVM also has a built-in sampler that works on remote JVMs via JMX. It's not safe for production (adds ~5% overhead), but it's perfect for staging environments where you can't install async-profiler.

The killer feature: 'Monitor' tab shows live GC activity, heap usage, and thread count with zero configuration. Attach it to a local JVM, watch for sawtooth patterns in heap usage — if your heap graph looks like a mountain range, you've got a survivor space promotion problem.

dumpHeapForAnalysis.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// io.thecodeforge — java tutorial

import java.lang.management.ManagementFactory;
import com.sun.management.HotSpotDiagnosticMXBean;

public class dumpHeapForAnalysis {
    public static void dumpHeap(String filePath, boolean live) throws Exception {
        HotSpotDiagnosticMXBean bean = ManagementFactory
            .getPlatformMXBean(HotSpotDiagnosticMXBean.class);
        bean.dumpHeap(filePath, live);
        System.out.println("Heap dumped to: " + filePath);
    }
    
    public static void main(String[] args) throws Exception {
        // Trigger dump when memory reaches 80%
        Runtime runtime = Runtime.getRuntime();
        long usedMemory = runtime.totalMemory() - runtime.freeMemory();
        long maxMemory = runtime.maxMemory();
        if (usedMemory > maxMemory * 0.8) {
            dumpHeap("/tmp/critical_heap.hprof", true);
        }
    }
}
Output
Heap dumped to: /tmp/critical_heap.hprof
// VisualVM OQL query: select s.toString() from java.lang.String s where s.toString().startsWith("org/apache")
// Finds all strings from Apache packages in the heap
Legacy App Lifesaver:
VisualVM's 'Sampler' tab can profile remote JVMs via JMX without agent deployment. Set up a JMX port in your Docker container and connect from your local VisualVM — works through firewalls if you expose the right ports.
Key Takeaway
VisualVM is the 'swiss army knife' for heap dump analysis — master OQL queries and you can diagnose any memory leak without a commercial license.

JFR Event Streaming: Ditch the Dump Files

Stop dumping JFR recordings to disk and hoping you catch the issue. Real production profiling means reacting to events as they happen, not sifting through 20GB of historical data hours later. JFR Event Streaming lets you subscribe to specific events—like GC pauses, allocation stalls, or lock contention—and act on them in real time.

Why this matters: A 100ms GC pause every 30 seconds is background noise. A 500ms pause every 5 seconds is a pager alert. With event streaming, you can trigger an async-profiler capture the moment a metric crosses your threshold. No more blind dumps. No more guessing if the problem is still happening.

The HOW is simple: use jdk.jfr package, register a handler for jdk.GCPause or jdk.AllocationRequiringGC, and pipe the data to your monitoring stack. Your production incident playbook just got faster.

JFREventStreaming.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — java tutorial

import jdk.jfr.consumer.RecordingStream;

public class JFREventStreaming {
    public static void main(String[] args) throws InterruptedException {
        try (var rs = new RecordingStream()) {
            rs.enable("jdk.GCPause").withThreshold("10 ms");
            rs.onEvent("jdk.GCPause", event -> {
                long duration = event.getLong("duration");
                if (duration > 200_000_000L) { // 200ms in nanos
                    System.err.println("⚠️ GC pause exceeded 200ms: " 
                        + duration / 1_000_000 + "ms");
                    // Trigger async-profiler or alert here
                }
            });
            rs.start();
        }
    }
}
Output
⚠️ GC pause exceeded 200ms: 342ms
⚠️ GC pause exceeded 200ms: 511ms
Production Trap:
Don't enable too many events in streaming mode. Filter aggressively: only subscribe to events with withThreshold() or withPeriod() to avoid burning CPU on event dispatch. Event streaming is not free.
Key Takeaway
JFR Streaming turns historical profiling into real-time observability. React, don't reconstruct.

Off-Heap Memory: The Invisible Leak

Your heap is clean. No GC pressure. No obvious leaks. Yet the container RSS keeps climbing until OOMKill. Welcome to off-heap memory leaks—the silent killer that most profilers ignore by default. DirectByteBuffers, mapped files, JNI native allocations, and Netty's buffer pools all live outside the heap.

Why you care: Heap dumps won't show these. JFR won't report them unless you explicitly enable jdk.NativeMemory events. And -Xmx settings are a lie—the JVM will happily allocate more than that via direct memory. The worst part? Off-heap leaks often surface hours or days after deploy, mimicking a slow GC leak.

The fix: Enable NativeMemoryTracking (NMT) with -XX:NativeMemoryTracking=summary or detail. Use jcmd <pid> VM.native_memory to diff allocations over time. For DirectByteBuffer analysis, enable -XX:+PrintDirectByteBufferStats in debug JVM. If you see a steady climb in 'Internal' or 'Other' categories, you've found your ghost.

OffHeapDetect.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — java tutorial

import java.nio.ByteBuffer;

public class OffHeapDetect {
    // Simulate a leak: allocate direct buffers without releasing
    static final java.util.List<ByteBuffer> leak = new java.util.ArrayList<>();
    
    public static void main(String[] args) throws Exception {
        System.out.println("Allocating direct buffers...");
        while (true) {
            leak.add(ByteBuffer.allocateDirect(1024 * 1024)); // 1 MB
            System.out.print(".");
            Thread.sleep(10);
        }
    }
}
Output
Allocating direct buffers...
..........................
# JVM will eventually throw:
# Exception in thread "main" java.lang.OutOfMemoryError: Direct buffer memory
Senior Shortcut:
Run jcmd <pid> VM.native_memory baseline on deploy, then jcmd <pid> VM.native_memory summary.diff after 24 hours. If any category grew more than 10%, investigate immediately. NMT has ~5% CPU overhead—safe for production if you use summary mode.
Key Takeaway
If your heap is fine but your container is bleeding memory, the leak is off-heap. NMT is your first weapon.

Wrap Up

Profiling Java applications in production is a delicate dance between gaining insights and maintaining system stability. Throughout this series, we've covered the full spectrum—from production-safe techniques like JFR Event Streaming to off-heap memory hunting and G1 GC tuning. The key lesson is that profiling isn't a single tool or one-time activity; it's a mindset. You start with high-level flame graphs from async-profiler to spot bottlenecks, then drill into heap dump analysis with OQL for memory leaks, and finally tune GC parameters iteratively. Containers add complexity—resource limits from Docker and Kubernetes can mask JVM internals—so always verify with -XX:+PrintFlagsFinal inside the container. Tools like YourKit and JProfiler still excel at deep object navigation, while IntelliJ Profiler and VisualVM serve different niches: quick ide-based checks and post-mortems, respectively. The thread-safe golden rule: never profile production with a heavy tool. Async-profiler, JFR, or lightweight commercial agents are your friends. Everything else belongs in staging.

ProfileChecklist.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — java tutorial
// Final profiler check before production
import java.lang.management.*;
public class ProfileChecklist {
    public static void main(String[] args) {
        ThreadMXBean threadMX = ManagementFactory.getThreadMXBean();
        System.out.println("Thread count: " + threadMX.getThreadCount());
        System.out.println("Peak: " + threadMX.getPeakThreadCount());
        long[] deadlocked = threadMX.findDeadlockedThreads();
        if (deadlocked != null && deadlocked.length > 0)
            System.out.println("⚠ DEADLOCK DETECTED");
        else
            System.out.println("✓ No deadlocks");
    }
}
Output
Thread count: 12
Peak: 15
✓ No deadlocks
Production Trap:
Never leave profiler agents attached after diagnosis—they can degrade performance by 5-15% even when idle.
Key Takeaway
Profiling is iterative: heat map first, heap analysis second, GC tuning third, always with production-safe tools.

8. Conclusion

Java profiling is not about finding a single silver bullet—it's about building a diagnostic workflow that adapts to your environment. You now have a toolkit: async-profiler for low-overhead CPU sampling, JFR Event Streaming for continuous telemetry, OQL for forensic heap analysis, and G1 GC tunings for latency-sensitive apps. The container ecosystem demands extra vigilance—Docker CPU throttling skews profiler results, and Kubernetes OOMKilled pods leave no core dump unless configured. Remember the Pareto principle: 80% of performance gains come from understanding 20% of your code paths. Start with flame graphs, identify the hottest methods, and only then reach for heap or GC tools. Avoid premature optimization. Profile in production only with JFR or async-profiler; save VisualVM and memory dump tools for staging or post-mortem. Finally, instrument your builds with early detection—spot regressions before they hit production. The codeforge is only as fast as its weakest thread. Profile wisely.

ProfilerSummary.javaJAVA
1
2
3
4
5
6
7
8
9
10
// io.thecodeforge — java tutorial
// Summary: choose your profiler
record Profiler(String name, boolean productionSafe, double overhead) {}
public class ProfilerSummary {
    public static void main(String[] args) {
        var p = new Profiler("async-profiler", true, 0.02);
        System.out.printf("✓ %s | safe=%s | overhead=%.1f%%%n",
            p.name(), p.productionSafe(), p.overhead());
    }
}
Output
✓ async-profiler | safe=true | overhead=0.0%
Final Word:
Your profiling workflow should be a loop: measure, hypothesize, verify, adjust. Never trust single-shot data.
Key Takeaway
Build a repeatable profiling pipeline—flame graphs, heap analysis, GC logs—and tune per workload, not per tool.
● Production incidentPOST-MORTEMseverity: high

The 2 AM Latency Spike: G1 Concurrent Mode Failure

Symptom
Every 30 minutes during high traffic (8k req/min), response times jumped from 15ms to 3 seconds, then the service returned 503 for about 20 seconds before recovering. No errors in app logs, but GC logs showed 'Concurrent Mode Failure'.
Assumption
The team assumed a traffic spike was overwhelming the thread pool. They doubled the instance count — no improvement. The pattern persisted.
Root cause
G1 GC was configured with default -XX:InitiatingHeapOccupancyPercent=45. Under high allocation rates, the concurrent marking phase couldn't keep up, triggering a full STW (stop-the-world) compaction. The 20-second pause matched the errors exactly.
Fix
Reduced -XX:InitiatingHeapOccupancyPercent to 30 to start concurrent marking earlier, and increased heap from 4GB to 8GB to give G1 more runway. Full STW pauses dropped to zero. Latency returned to 15ms.
Key lesson
  • Never tune GC without first profiling allocation rates and pause frequencies.
  • Monitor GC logs alongside application metrics — they tell the real story.
  • Always test GC changes under production-like load; a config that works at 1k req/min may fail at 10k.
  • Concurrent mode failure is the production killer; know your IHOP setting before you need to change it.
Production debug guideWhen your app slows down or starts leaking memory, use this symptom-to-action routing to pick the right profiling tool and command.8 entries
Symptom · 01
CPU usage high but no obvious hot method in logs
Fix
Attach async-profiler for a CPU profile: ./profiler.sh -e cpu -d 30 -f cpu.html <PID>
Symptom · 02
Heap grows over hours without recovery — suspect leak
Fix
Take a live heap dump: jmap -dump:live,format=b,file=heap.hprof <PID>. Analyze with Eclipse MAT or JProfiler.
Symptom · 03
Intermittent latency spikes with no CPU saturation
Fix
Collect a wall-clock profile: async-profiler -e wall -d 30 -f wall.html <PID>. Look for lock contention or GC pauses.
Symptom · 04
Thread pool reports 'queue full' exceptions
Fix
Capture thread dumps every 5 seconds for 30 seconds: jstack <PID> > threaddump.txt (repeat). Look for threads in BLOCKED or WAITING state.
Symptom · 05
App crashes with OutOfMemoryError after hours of uptime
Fix
Add -XX:+HeapDumpOnOutOfMemoryError to JVM flags. The next crash generates a heap dump at the exact moment of failure.
Symptom · 06
Performance regression after deployment — no visible symptom change
Fix
Compare JFR recordings from before and after deployment using JDK Mission Control's automated analysis. Focus on GC pause times, allocation rates, and lock contention.
Symptom · 07
Application starts slow after deployment
Fix
Capture a JFR recording with startup events: -XX:StartFlightRecording=filename=startup.jfr,settings=profile. Analyze the initialization phase in JDK Mission Control.
Symptom · 08
Response time correlated with GC logs; suspect concurrent mode failure
Fix
Inspect GC logs for 'Concurrent Mode Failure' or 'G1 Evacuation Pause SS:'. Check InitiatingHeapOccupancyPercent and use jcmd to trigger a GC dump.
★ 5-Second Profiling Commands for Production EmergenciesWhen shit hits the fan, you don't have time to read docs. These commands work on any modern Java 11+ JVM with minimal overhead.
CPU spike, no obvious culprit
Immediate action
Grab a CPU flame graph with async-profiler
Commands
./profiler.sh -e cpu -d 15 -f flame.html <PID>
Open flame.html in browser — look for the widest frames
Fix now
If one method dominates (e.g., regex or serialization), cache or replace it
Heap climbing, likely leak+
Immediate action
Trigger a live heap dump
Commands
jcmd <PID> GC.heap_dump /tmp/dump.hprof
Open in Eclipse MAT → Dominator Tree → look for biggest objects
Fix now
Identify the holding root (e.g., static cache, thread-local) and add eviction or weak references
Allocations causing excessive GC+
Immediate action
Run allocation profile with async-profiler
Commands
./profiler.sh -e alloc -d 30 -f alloc.html <PID>
Filter by allocation size — look for escape-analyzed objects
Fix now
Inline object creation or pool objects: StringBuilder instead of String concat, Array instead of ArrayList
Service unresponsive, threads stuck+
Immediate action
Capture a thread dump stack trace
Commands
jstack <PID> > /tmp/threads_1.txt; sleep 5; jstack <PID> > /tmp/threads_2.txt
Compare two dumps — threads that haven't moved are likely deadlocked
Fix now
Kill the node and fix the lock ordering; add timeout to blocking calls
Performance degraded after container restart+
Immediate action
Check if JFR is still enabled — container restarts lose JVM flags
Commands
jcmd <PID> JFR.check
If no recording, restart with -XX:StartFlightRecording or attach via jcmd
Fix now
Add always-on JFR startup flags to your Dockerfile or deployment template
GC pauses causing latency spikes+
Immediate action
Check GC logs and JFR recording
Commands
jcmd <PID> JFR.dump name=gcrecording filename=gc_pauses.jfr
Open in JDK Mission Control → GC Pauses view
Fix now
If concurrent mode failure, reduce IHOP or increase heap size

Common mistakes to avoid

5 patterns
×

Optimizing without profiling first

Symptom
Weeks spent tuning the wrong code path; no improvement in latency or throughput.
Fix
Always start with JFR or async-profiler to identify the actual hot spots before making any changes.
×

Using jmap -histo without :live

Symptom
Heap dump includes garbage objects, making analysis misleading and slow.
Fix
Use jmap -histo:live or jcmd <pid> GC.heap_dump for a live dump that excludes unreachable objects.
×

Ignoring wall-clock profiling

Symptom
CPU looks fine but latency is high; the bottleneck is lock contention or I/O.
Fix
Use async-profiler's wall-clock mode (-e wall) to capture blocked threads and correlate with thread dumps.
×

Tuning GC before measuring allocation rate

Symptom
After extensive tuning, GC still causes pauses because allocation rate overwhelms the collector.
Fix
Profile allocation rate with JFR or async-profiler alloc mode first. If rate >500MB/s, fix allocation before tuning GC.
×

Assuming one heap dump tells the full story

Symptom
Leak Suspects report points at a large object, but it's a stable cache, not a leak.
Fix
Take two heap dumps an hour apart and compare retained heap sizes of suspect classes to calculate leak rate.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between JFR and async-profiler. When would you us...
Q02SENIOR
Your team reports a 20-second latency spike every 30 minutes during peak...
Q03SENIOR
How do you find the root cause of a memory leak in a Java application? D...
Q04SENIOR
What are the key differences between G1, ZGC, and Shenandoah? When would...
Q05SENIOR
What precautions do you take when profiling a production JVM?
Q01 of 05SENIOR

Explain the difference between JFR and async-profiler. When would you use each?

ANSWER
JFR is built into the JDK and records a wide range of events (GC, allocations, JIT, I/O) with less than 1% overhead. It's designed for continuous monitoring — you start it and leave it running. async-profiler is an external agent that uses kernel sampling (perf_events) for CPU and allocation flame graphs. It's better for targeted, ad-hoc investigations because you can attach it to a running process and get a flame graph in seconds. Use JFR for always-on observability and async-profiler when you need a deep dive on a specific symptom. They complement each other: run JFR continuously, attach async-profiler when you see an anomaly in JFR data.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What's the difference between CPU profiling and wall-clock profiling?
02
Can I use JFR and async-profiler simultaneously?
03
How do I know if my GC needs tuning?
04
Why does async-profiler fail in my Kubernetes container?
05
What's the quickest way to find a memory leak in production?
N
Naren Founder & Principal Engineer

20+ years shipping production Java in banking & fintech. Lessons pulled from things that broke in production.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's Advanced Java. Mark it forged?

29 min read · try the examples if you haven't

Previous
Maven vs Gradle in Java
23 / 28 · Advanced Java
Next
Java Logging with SLF4J and Logback