Senior 25 min · March 06, 2026

G1 GC Concurrent Mode Failure — Fix 20s STW Pause

At 8k req/min, G1 GC's default IHOP=45 triggered Concurrent Mode Failure, causing 20-second pauses and 503 errors.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Java profiling measures CPU, memory, thread, and I/O usage in a live application
  • JFR (Java Flight Recorder) is built-in, low-overhead event recording for continuous monitoring
  • async-profiler combines CPU sampling, allocation profiling, and wall-clock profiling in one agent
  • Flame graphs visualize call stacks as rectangles; width = time spent, color = function type
  • Biggest mistake: optimizing without profiling first — you'll guess wrong every time
  • Performance insight: JFR targets <1% overhead; async-profiler adds ~2% during sampling
Plain-English First

Imagine your Java app is a restaurant kitchen. Orders are coming in, but food is taking forever to reach customers. Profiling is like installing cameras and timers on every chef, every station, and every oven — so you can see exactly WHO is slow, WHERE the bottleneck is, and WHY the kitchen is on fire. Without profiling, you're just guessing. With it, you walk straight to the broken fryer.

Most Java performance problems don't announce themselves. They show up as mysterious latency spikes at 2am, a heap that grows 10MB per hour until the app dies, or a thread pool that silently saturates under load while your dashboards look green. These aren't bugs in the traditional sense — they're invisible tax your code pays at scale, and they're the kind of thing that separates senior engineers from everyone else.

Profiling is the discipline of measuring your running application to find exactly where CPU time, memory, threads, and I/O are going — before you optimize anything. The cardinal sin in performance work is optimizing without data. You'll almost always guess wrong, spend a week tuning the wrong method, and make the codebase harder to read for zero gain. Good profiling tools give you a flame graph that says 'this one method accounts for 43% of your CPU' — and suddenly the path forward is obvious.

This guide gets straight to the point: which profiling tool for which symptom, how to read flame graphs and heap dumps without getting lost, the production safeguards that prevent profiling from becoming the cause of your next incident, and the one workflow that makes performance tuning repeatable. Time to get into the details.

Here's the hard truth: most teams waste weeks on performance work because they skip the baseline. You need to know what 'normal' looks like before you can spot abnormal. That means setting up JFR on day one, even if you're not debugging anything. The data you collect when everything is fine is your most valuable asset when things break.

What is Java Profiling and Performance?

Java Profiling and Performance is a core concept in Java. Rather than starting with a dry definition, let's see it in action and understand why it exists.

When you run a real server, profiling gives you the answers to three hard questions: where is the CPU going, where is the memory going, and what is the application waiting on? Without this data, every optimisation is guesswork. The tools we'll cover — JFR, async-profiler, and heap dump analyzers — let you answer those questions without restarting your JVM or modifying your code. They attach to a running process and record what's happening in real time, with overhead so low you can run them in production during business hours.

Here's a concrete trap: a team spent two weeks optimising a database query that appeared slow in their test environment. After profiling the production instance, they discovered the actual bottleneck was thread contention in their object mapper — the query was fine. Profiling first would've saved them a sprint.

Think of profiling as your JVM's black box recorder — you want it running before the incident, not after. The same way you'd never debug a plane crash without the flight data recorder, you shouldn't debug a production slowdown without profiling data. That's why always-on JFR is the first tool you set up.

Here's the thing: profiling isn't just about finding hot spots. It's about building a baseline. If you don't know what normal looks like — normal allocation rate, normal GC pause distribution, normal thread states — you can't recognise abnormal when it hits. Start recording today, even if you're not debugging anything. You'll thank yourself at 3am six months from now.

One more nuance: profiling doesn't replace good logging. It complements it. Logs tell you what happened; profiling tells you why the CPU was pinned. Always correlate profiler data with your application logs and metrics dashboards.

Without profiling, every optimization is a shot in the dark — and your users pay for every miss.

Let's expand the mental model: think of your JVM as a factory floor. CPU profiling is a heat map of machine usage, memory profiling is an inventory of raw materials, and wall-clock profiling is a stopwatch for every process. The manager (you) uses all three to find the bottleneck. A common mistake is focusing only on CPU – memory leaks and lock contention are just as critical. Always collect all three pillars before making a change.

Production story: A team's CPU profile looked clean, but latency was high. They assumed network. Wall-clock profiling revealed lock contention on a shared cache. The fix was a read-write lock, slashing latency by 60%. If they'd only looked at CPU, they'd have missed it entirely.

io/thecodeforge/ForgeExample.javaJAVA
1
2
3
4
5
6
7
8
package io.thecodeforge;

public class ForgeExample {
    public static void main(String[] args) {
        String topic = "Java Profiling and Performance";
        System.out.println("Learning: " + topic);
    }
}
The Three Pillars of Profiling
  • CPU profiling shows which methods consume processor cycles — hot spots from intense computation.
  • Memory profiling tracks object allocations and heap occupancy — finds leaks and high allocation rates.
  • Wall-clock profiling measures actual elapsed time — captures blocking on locks, I/O, and GC pauses.
Production Insight
Most teams start profiling only after a crisis — by then, it's too late to see the gradual pattern.
Proactive profiling during normal load reveals the silent killers (e.g., slow allocation growth).
Rule: profile one instance per cluster for 5 minutes each day during peak hours, store recordings, compare weekly.
Also: profile after every major deployment to catch regressions before they reach users.
Don't ignore off-peak profiles: idle JVMs can still have memory leaks or thread buildup that only show outside of load.
Always include wall-clock profiling in your baseline — CPU-only profiles hide lock contention and IO waits.
Key Takeaway
Profiling is the starting point, not the reward.
Always collect data before acting — the biggest optimizations come from the data, not intuition.
Set up JFR on day one, even if you don't need it yet.
You can't fix what you don't measure: baseline profiling is your safety net.
Which Profiling Pillar to Start With?
IfHigh CPU usage and response time spikes
UseStart with CPU profiling to find hot methods.
IfGradual heap growth and eventual OOM
UseStart with memory profiling via heap dump or allocation profile.
IfErratic latency but low CPU
UseStart with wall-clock profiling to find lock/IO waits.

Core Profiling Tools: JFR and async-profiler

Two tools dominate modern Java profiling: Java Flight Recorder (JFR) and async-profiler. JFR is built into the JDK since version 11 (formerly commercial-only). It records fine-grained events — GC pauses, thread allocations, JIT compilations, IO operations — with less than 1% overhead. You start it with jcmd <pid> JFR.start and dump a recording file later. No JVM restart needed.

async-profiler is an open-source agent that uses a combination of perf_events (on Linux) and a custom JVMTI agent to produce CPU and allocation flame graphs. It's the go-to tool for ad-hoc profiling because you can attach it to a running process, collect a 30-second sample, and get an interactive HTML flame graph you can share with the team. It supports two sampling modes: CPU (only samples running threads) and wall (samples all threads, including those blocked on IO or locks).

The choice depends on your use case: JFR for continuous, always-on monitoring (think of it like a JVM black box); async-profiler for targeted investigations when you suspect a specific function or class.

One important nuance: JFR's profile.jfc template samples method stacks at a fixed frequency, while async-profiler uses kernel sampling via perf_events. The two approaches can give different results for very short methods. Always validate hot spots with both tools if the impact is high.

Production trap: async-profiler's CPU mode relies on perf_events, which can conflict with container CPU limits in Kubernetes. If your pod is throttled, you'll see inflated CPU percentages in the flame graph. Always correlate with host-level metrics.

There's a nuance though: if you're in a container without SYS_ADMIN, async-profiler CPU mode won't work. Fall back to -e itimer or JFR. We'll cover container profiling later.

One more thing: don't sleep on JFR's event streaming API. Since JDK 14, you can subscribe to JFR events programmatically — no dump files, no file I/O. You get a live stream of GC pauses, allocation ticks, and lock contention as they happen. It's the best way to build custom monitoring without adding agents.

Another hidden feature: JFR can record network I/O and file system events. Use the -XX:+UnlockDiagnosticVMOptions -XX:+FlightRecorder flags to enable socket reads/writes in the recording. This helps when the bottleneck is external.

JFR's event streaming API is your secret weapon for real-time observability — use it to trigger alerts on allocation rate spikes without polluting your logs.

Also consider combining both tools: run JFR continuously, and when you see an anomaly in the JFR dashboard, attach async-profiler for a deep dive. This hybrid approach gives you both broad coverage and surgical precision. The cost is minimal — JFR overhead is <1% and async-profiler is temporary.

Production story: A team used async-profiler CPU mode and saw no hot methods. But JFR's allocation profile revealed a massive allocation rate from a logging framework at DEBUG level. They turned down log verbosity, saving 20% CPU — a fix that CPU profiling alone couldn't find.

profiling_commands.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Start JFR recording with default template (low overhead)
jcmd <pid> JFR.start name=monitor duration=60s filename=/tmp/recording.jfr

# Dump an active recording without stopping it
jcmd <pid> JFR.dump name=monitor filename=/tmp/recording.jfr

# CPU profile with async-profiler for 30 seconds
./profiler.sh -e cpu -d 30 -f flame_graph.html <pid>

# Allocation profile (requires async-profiler's alloc mode)
./profiler.sh -e alloc -d 30 -f alloc.html <pid>

# Wall-clock profile (includes blocked threads)
./profiler.sh -e wall -d 30 -f wall.html <pid>
Tool Selection Rule
If you don't know which tool to use, start with JFR. It records everything with <1% overhead and you can analyze it later. Flame graphs are faster to read but you need to know what to look for.
Production Insight
JFR overhead is <1% even with all events enabled.
async-profiler's CPU mode uses perf_events on Linux — it can conflict with container CPU limits.
Rule: always test profiling overhead on a canary instance before attaching to production traffic.
JFR's event streaming API allows real-time allocation rate alerts — use it to catch leaks early.
async-profiler's wall-clock mode is essential for lock contention; CPU mode alone gives a false sense of low CPU.
JFR's streaming API can alert on allocation rate spikes — integrate with Prometheus via jfr-exporter.
Key Takeaway
Use JFR for continuous background monitoring.
Use async-profiler for targeted, ad-hoc diagnostic sessions.
The combo covers every common profiling scenario without restarting your JVM.
JFR's event streaming API is your real-time observability layer — use it.
Don't forget to test both tools in your specific environment before an emergency.
Choose Your Tool: JFR vs async-profiler
IfContinuous monitoring for historical analysis
UseUse JFR with default template (24/7).
IfAd-hoc investigation of a specific symptom
UseUse async-profiler with the relevant event (cpu/alloc/wall).
IfNeed both long-term recording and instant flame graphs
UseRun JFR continuously and async-profiler on demand.
IfNeed to correlate GC pauses with application latency
UseUse JFR to get GC event timestamps and match with request latency.

Reading Flame Graphs: What to Look For

A flame graph is a visual representation of a stack trace sample set. The x-axis groups stack frames alphabetically, and the width of each rectangle is proportional to the number of samples that included that frame. The y-axis is the stack depth — the top is the function actually running, and below it are its callers. Color typically indicates the function type (red for native, yellow for Java, green for interpreted, blue for GC, etc.).

When reading a flame graph, start at the top and look for the widest frames. Those are your hot spots. A common trap is staring at a wide frame that's a low-level method like Unsafe.park() — that's not the culprit; its caller is. Always trace wide frames upward to find the application code that triggers them. If the graph shows many thin, tall towers, you have deep call stacks — often recursion or poorly designed frameworks. If the graph looks like a plateau (many wide frames at similar depth), you have multiple hot spots.

Another pattern: a 'mountain' shape with a single wide top — that's your one bottleneck. A 'volcano' with multiple peaks — load is spread across several paths; optimising any one may shift the bottleneck without much improvement overall.

Real-world mistake: One team saw a wide frame for java.util.HashMap.put() and assumed they needed a faster hash map. But the flame graph showed it was called from a logging framework at DEBUG level. Turning down log verbosity fixed the CPU usage in 30 seconds. The frame width lies without context.

You'll also encounter 'icicle' graphs (inverted) where the root is at the bottom — those are common in async-profiler's output. Same reading technique: look for widest at the top of the icicle.

Here's a rule I've learned the hard way: always generate a flame graph during both peak and off-peak load. A flame graph from a quiet period shows you nothing useful. The hot spots only reveal themselves under pressure. Profile under load or don't profile at all.

Also note: flame graphs can be misleading for lock contention. A thread blocked on a lock doesn't appear CPU-sampled. That's why you need wall-clock mode. Always cross-reference flame graphs with thread dumps.

Short-lived methods can be invisible in CPU flame graphs — use JFR to capture them. Every hot spot you fix changes the shape of the graph; retest after each optimisation.

One advanced tip: use differential flame graphs to compare two profiling sessions. For example, compare before and after a deployment. The diff shows exactly which methods got hotter or cooler. This is invaluable for detecting performance regressions. async-profiler doesn't generate diffs natively, but you can use the FlameGraph toolkit's difffolded.pl script.

Another common trap: reading a flame graph from a single sample. Always take multiple samples over time to see the trend. A 30-second sample may miss intermittent spikes.

flame_graph_reading.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Generate a flame graph from async-profiler output
./profiler.sh -e cpu -d 30 -f raw.txt <pid>
# Convert to SVG using FlameGraph toolkit
stackcollapse-perf.pl raw.txt | flamegraph.pl > flame.svg

# Or use async-profiler's built-in HTML output (easier to share)
./profiler.sh -e cpu -d 30 -f flame.html <pid>

# Differential flame graph (requires two folded stacks)
./profiler.sh -e cpu -d 30 -f before.txt <pid_before>
./profiler.sh -e cpu -d 30 -f after.txt <pid_after>
stackcollapse-perf.pl before.txt > before.folded
stackcollapse-perf.pl after.txt > after.folded
difffolded.pl before.folded after.folded | flamegraph.pl --negate > diff.svg
Flame Graph Shapes
  • Mountain shape: one dominant bottleneck — optimize this method.
  • Volcano shape: multiple hot spots — improving one may shift bottleneck.
  • Plateau shape: many methods consuming roughly equal time — focus on allocation or I/O.
Production Insight
Flame graphs aggregate over time — they can hide short-lived spikes.
Always correlate with latency metrics: a wide frame may be innocent if it runs during idle periods.
Biggest mistake: interpreting width as 'bad' — wide may mean nothing if the function is expected to take time (e.g., waiting on DB).
Short methods can be invisible; use JFR's method profiling for complete picture.
Differential flame graphs are the best tool for catching regressions — use them after every deployment.
Differential flame graphs should be part of every post-deploy verification.
Key Takeaway
Look for widest frames at the top.
Trace them upward to find the application code that calls them.
Correlate flame graphs with latency percentiles, not just averages.
A mountain shape is easier to fix than a plateau.
Use differential flame graphs to spot regressions instantly.

Heap Analysis: Finding the Leak

Memory leaks in Java are almost never about unreachable objects — those get GC'd. The real leaks come from accidental retention: objects that remain reachable but are no longer needed. Common patterns include static collections (caches without eviction), thread-local variables that accumulate, or JDBC statements not closed. Profiling a leak means taking a heap dump at a point when you suspect the heap has grown, then analyzing it to find the 'dominator' objects that hold the most memory.

Eclipse MAT (Memory Analyzer Tool) is the de facto standard for this. You load a heap dump, run the 'Leak Suspects' report, and it highlights the biggest retained sets. The 'Dominator Tree' view shows which objects would be freed if a given root were removed. For example, if a single HashMap holds 90% of the heap with 2 million stale entries, that's the leak.

Important: take a live heap dump (jmap -dump:live) to exclude unreachable objects. Full heap dumps include all objects and take much longer to process. Always capture a few dumps over time to see the growth rate — one snapshot can't tell you if the growth is a leak or just a large but stable cache.

Also consider using JFR's allocation profiling to find the call sites that produce the most garbage. Sometimes the fix is not to remove the collection but to reduce its creation rate.

Production nuance: A heap dump from a process that's about to OOM might be truncated — critical objects could be missing. Always capture a second dump after recovery to compare. Also, the Leak Suspects report is a heuristic; always verify by examining the retaining stack traces in the dominator tree.

I once saw a Leak Suspects report point at a HashMap in logging, but the real culprit was a thread-local cache in the user session handler. Always cross-check with the thread overview and dominator tree.

Here's the uncomfortable truth about heap analysis: by the time you notice the leak, it's been running for days. You need to calculate the leak rate. Take a dump, wait an hour, take another. If the retained heap of a suspect class grew by 200MB, you have a leak rate of ~3.3MB/min. That number tells you how long you have before the next OOM — and whether a hotfix can wait until the next release cycle.

One more tip: use jcmd instead of jmap on JDK 11+. It's faster, safer, and doesn't force a GC unless you specify live.

Leak rate calculation is your timeline to failure — use two dumps taken an hour apart to compute it. A leak rate of 5MB/min means you have hours, not days.

For advanced users: combine heap analysis with allocation profiling. Use async-profiler's -e alloc to see which code paths create the most objects, then correlate those call sites with the objects found in the heap dump. This cross-reference is far more powerful than either technique alone.

A story: A team's leak suspect report showed a HashMap in logging, but after using OQL to find the largest entries, they discovered it was a cache in their session handling. The fix was adding TTL eviction, reducing heap growth from 10MB/min to 0.

heap_analysis.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Trigger a live heap dump (forces GC first)
jmap -dump:live,format=b,file=heap_dump_$(date +%s).hprof <pid>

# Or use jcmd (preferred for JDK 11+)
jcmd <pid> GC.heap_dump /tmp/live_dump.hprof

# Automatically dump on OutOfMemoryError (add to JVM flags)
# -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/

# Analyze with Eclipse MAT (GUI) or OQL
# MAT -> File -> Open Heap Dump -> Leak Suspects Report

# Allocation profile with async-profiler
./profiler.sh -e alloc -d 30 -f alloc.html <pid>
Don't Forget Allocation Profiling
Heap dumps show you what's alive, not where it was created. Use async-profiler's -e alloc mode alongside heap dumps to pinpoint the code that allocates the most memory.
Production Insight
A heap dump from a process that's about to crash may be truncated — important objects may be missing.
Always take a second dump after the first to confirm the trend.
The 'Leak Suspects' report is a guesser — verify by examining the retaining stack traces in the dominator tree.
Compute leak rate by comparing two dumps taken an hour apart.
Combine heap dumps with allocation profiling for a complete picture.
Always take two heap dumps a fixed interval apart to compute leak rate.
Key Takeaway
Leaks hide in accidental retention, not unreachable objects.
Take live dumps with jcmd or jmap -dump:live.
Use MAT's Dominator Tree to find the one HashMap or ArrayList that holds everything.
Pair with allocation profiling to find the source.
Don't trust the Leak Suspects report blindly — verify with the dominator tree.
Calculate leak rate: two dumps, one hour apart, retained heap delta.
Choosing Heap Analysis Technique
IfSuspected leak from a static collection
UseUse MAT Dominator Tree to find the collection's retained heap.
IfSuspected large objects (e.g., strings, byte arrays)
UseUse OQL to find instances with @usedHeapSize > threshold.
IfSuspected thread-local accumulation
UseUse MAT thread overview to examine each thread's retained heap.

GC Tuning: The Production Reality

GC tuning is the most overrated performance activity. Most applications need none — default settings with G1 (Java 9+) or Parallel (pre-9) work fine up to moderate loads. Tuning only matters when you have evidence from profiling that GC is causing latency or throughput issues. That evidence comes from JFR GC events or from explicit GC logs.

When you do need to tune, the three most impactful knobs are: 1. Heap size (-Xms, -Xmx): Too small causes frequent GC, too large causes long pauses. Start with matching initial and max to reduce resizing overhead. 2. Pause time goal (-XX:MaxGCPauseMillis for G1): The GC tries to keep pauses under this, but may increase frequency or reduce throughput to meet it. A tight goal (like 10ms) forces more minor GCs. 3. InitiatingHeapOccupancyPercent (G1): When the heap occupancy after marking reaches this threshold, G1 triggers concurrent marking. Lower it to start earlier (reduces risk of concurrent mode failure). The default is 45%; reducing to 30% gives more time for concurrent work.

For ZGC and Shenandoah, the story is different: they aim for sub-millisecond pauses at the cost of some CPU overhead (ZGC uses load barriers, Shenandoah uses forwarding pointers). They shine on very large heaps (100GB+) but have higher baseline CPU usage. Profile your allocation rates first — if you're allocating 50GB/min, no GC will be happy.

One subtle trap: the -XX:MaxGCPauseMillis flag is a goal, not a guarantee. G1 will adjust the region set to try to meet it, but under high allocation pressure it may not be achievable. Monitor gc+pause logs to see if the target is consistently missed.

Real example: A team set MaxGCPauseMillis=50 on a 32GB heap with 80GB/min allocation rate. G1 could not meet the target and started triggering back-to-back young GCs, causing 30% throughput loss. They had to increase the pause target to 200ms and optimise allocation rates instead.

Default G1 settings assume moderate allocation rates (~100MB/s). If you're allocating >500MB/s, you need to tune regardless of heap size. The allocation rate is the real determinant of GC pressure.

Here's what most guides won't tell you: -XX:G1HeapRegionSize matters more than people think. On a 64GB heap with default 2MB regions, you get 32000 regions. That's a lot of tracking overhead. Bump it to 16MB or 32MB. Fewer regions, less bookkeeping, better pause predictability. I've seen this single change reduce mixed GC pauses by 40%.

Allocation rate is the real enemy — if it's >500MB/s, GC tuning alone won't save you. Profile allocation sources and fix them before touching GC flags.

Another subtle point: the -XX:+UseStringDeduplication flag can reduce memory usage in applications with many duplicate strings. But it adds CPU overhead. Profile before enabling to ensure net gain.

Don't forget to check your GC logs for promotion failures. If young objects are being prematurely promoted to old gen due to small survivor spaces, you'll see increased full GCs. Adjust -XX:SurvivorRatio or -XX:NewRatio to give young gen more room.

A story: a team was seeing 20s STW pauses. They reduced IHOP from 45 to 30 and increased heap from 4GB to 8GB. Pauses dropped to <100ms. They also fine-tuned G1HeapRegionSize from 2MB to 16MB, further reducing mixed GC pauses by 40%. The key was profiling allocation rates first.

gc_tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Enable GC logging (JDK 11+ unified logging)
-Xlog:gc*:file=gc.log:time,utctime,level,tags

# Tuning flags for a 16GB heap with tight pause targets
-Xms16g -Xmx16g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=100
-XX:InitiatingHeapOccupancyPercent=35
-XX:+UnlockExperimentalVMOptions
-XX:G1NewSizePercent=5
-XX:G1HeapRegionSize=16m

# Monitor GC pause times from JFR
jcmd <pid> JFR.start name=gcrecord duration=120s filename=gc.jfr settings=profile
The 10% Rule
If your GC pause times are under 10% of the total CPU time, tuning GC will have negligible impact. Focus on reducing allocation rates instead.
Production Insight
Default G1 settings work for 90% of apps — don't tune unless profiling shows GC as a top 3 hot spot.
Concurrent mode failure is the production killer; always monitor GCCause in logs.
ZGC and Shenandoah reduce pauses but steal CPU cycles — profile the trade-off before switching.
G1HeapRegionSize tuning can reduce mixed GC pauses by 40% on large heaps.
Allocation rate >500MB/s requires code fixes, not just GC flags.
G1HeapRegionSize tuning is underutilized; test with 16MB or 32MB for large heaps.
Key Takeaway
GC tuning is the last resort after CPU and allocation profiling.
Tune only the three knobs: heap size, pause time goal, and IHOP.
Always validate changes under production load, not synthetic benchmarks.
If you're allocating faster than GC can collect, tune the allocation first.
Monitor promotion failures and survivor space sizing — they cause hidden full GCs.
Allocation rate is the real enemy; tune code before GC flags.
GC Tuning Decision Flow
IfGC pauses exceed latency SLO
UseCheck heap size first; if already large, reduce MaxGCPauseMillis or switch to ZGC.
IfConcurrent mode failure observed
UseReduce InitiatingHeapOccupancyPercent (try 30% to 35%).
IfAllocation rate >500MB/s with high GC throughput
UseReduce allocation rate before tuning GC — profile call sites and pool objects.

Production-Safe Profiling: Do's and Don'ts

Profiling in production requires caution. The wrong tool or command can pause your JVM for seconds or even minutes. Here's what works safely:

Do use: jcmd for JFR commands (no JVM pause), async-profiler with perf_events (zero overhead when not sampling), and jstack for thread dumps (pauses the target thread briefly, but acceptable).

Don't use: jmap -histo without :live — it does NOT trigger a GC and gives you all objects including garbage, misleading. jmap -clstats and jhat are deprecated and can slow down the JVM. Avoid attaching old JVMTI agents (like HPjmeter) that require the JVM to be started with -agentpath.

Rule of thumb: If a profiling command requires you to add JVM flags and restart, test it on staging first. If it attaches to a running process and claims <5% overhead, it's likely safe for production. Always start with a 10-second sample on a single instance, verify the JVM doesn't backpressure, then expand to longer durations.

Never profile every instance in a cluster simultaneously — the aggregate overhead can saturate host resources and cause a cascading failure.

Real production horror: An engineer attached async-profiler to all 20 instances of a payment service simultaneously. The CPU overhead from perf_events caused cascading timeouts. The team had to kill the profiler and restart half the cluster. Rule: one instance at a time.

I once saw an engineer attach async-profiler to a JVM that had 95% heap usage. The profiler triggered additional memory allocation and the process OOM'd within 30 seconds. Rule: never profile a JVM that's over 80% heap. The one command that's safe in any state is jstack. Even when heap is 99% full, jstack still works.

Another thing nobody tells you: JVM TI agents (including async-profiler) can cause transient performance degradation during attachment and detachment. The JVM needs to safepoint all threads to load the agent. On a 64GB heap with 200 threads, that safepoint can take 200-500ms. Schedule your profiling sessions during maintenance windows or off-peak hours.

Also watch out: if you're using jattach (the default async-profiler attach method), it uses a Unix domain socket. In some container environments (like those with read-only root filesystem), jattach may fail. In that case, use the -f (file) option or fallback to JFR.

Never profile a JVM over 80% heap — you'll push it into OOM. jstack is the only safe command in critical state.

One more safe practice: use the --sync flag with async-profiler to delay sampling until the profiler is fully attached. This avoids the initial safepoint overhead being captured as part of the profile. For example: ./profiler.sh -e cpu -d 30 --sync -f flame.html <pid>.

Critical: Know When to Walk Away
If your JVM is already in a critical state (heap near OOM, threads deadlocked, CPU pinned), do NOT attach any profiling tool. The extra allocation or bytecode instrumentation can push it over the edge. Instead, take a heap dump (jmap -dump:live) or thread dumps (jstack) — these are safe even under duress — and analyze offline.
Production Insight
The safest profiling is no profiling at all when the JVM is more than 80% heap usage.
Always have a rollback plan: know how to detach a profiler quickly.
Monitor host CPU after attaching async-profiler — it can cause additional load.
Jattach may fail in read-only container filesystems — test your attach method in staging.
Use the --sync flag to avoid capturing safepoint overhead in your profile.
On a 64GB heap with 200 threads, agent attachment safepoint can take 200-500ms; plan accordingly.
Key Takeaway
Profiling in production is safe — with the right tools and caution.
Start small, test one instance, never profile the whole cluster at once.
When in doubt, fall back to safe commands: jcmd, jstack, and live heap dumps.
If the JVM is above 80% heap, don't profile — just dump and run.
Know your attach method — jattach may fail in containers.
Safe vs Unsafe Profiling Actions
IfJVM health is critical (>80% heap)
UseOnly use jstack and jmap -dump:live (no async-profiler).
IfJVM is stable but you need a quick flame graph
UseAttach async-profiler for 15 seconds on a canary instance.
IfNeed continuous monitoring with low risk
UseStart JFR recording with default template (never triggers STW).

Profiling in Containers: Docker and Kubernetes Pitfalls

Containerized Java apps introduce new profiling complications that can lead to false data or no data at all. The core issue: perf_events (used by async-profiler's CPU mode) are restricted inside containers unless the container runs with elevated privileges or specific sysctl settings.

In Docker, you need --cap-add=SYS_ADMIN or --security-opt seccomp=unconfined to allow async-profiler CPU profiling. Without that, the profiler will fail with "No access to perf events." The safer workaround is to use async-profiler's -e itimer mode, which uses a timer-based approach instead of perf_events — slightly less precise but works in unconfined containers.

In Kubernetes, the situation is trickier. Even with perf_event_open allowed, container CPU limits via CFS can cause the profiler to see inflated CPU percentages because the kernel throttles the container. Your flame graph might show a wide __schedule frame — that's the throttling, not your code. Always correlate with container CPU usage metrics from cgroups.

JFR works reliably in containers because it doesn't depend on perf_events. However, JFR recordings from inside a container reflect only the container's view of CPU and memory. If you have a shared node, the JFR data won't show other containers' resource contention. Use kubectl top or node-level Prometheus metrics to get the full picture.

Production story: A team saw a recurring "CPU spike every 5 minutes" in their Kubernetes dashboard. async-profiler showed sun.rmi.transport.tcp.TCPTransport.handleMessages at the top — it was JMX RMI heartbeat threads battling with CFS throttling. They switched to a non-blocking JMX connector and the spikes disappeared.

If async-profiler CPU mode fails in your container, the first thing to check is the container's capabilities. If you can't add SYS_ADMIN, switch to -e itimer or use JFR.

Here's the container profiling trap I keep seeing: teams deploy JFR but never look at the recordings because they don't have JDK Mission Control in their workflow. Set up automated JFR dump collection. Have a cron job copy the last hour of JFR data to object storage. When an incident happens, you have the evidence waiting for you.

Another nuance: some Kubernetes platforms (OpenShift, GKE sandbox) block perf_event_open entirely. In that case, JFR is your only option. Always test your profiling toolchain on your specific container platform before production emergencies.

Use cat /sys/fs/cgroup/cpu/cpu.stat to see throttled time — if nr_throttled > 0, CFS is impacting your CPU profile.

A common mistake: assuming that a container with 2 CPU cores has full access to both. If the CPU limit is set, CFS throttles the container when it exceeds its quota. This shows up as __schedule in flame graphs. The fix is to either increase CPU limits or reduce allocation rates to stay under the throttling threshold.

Always test profiler permissions in your container platform before production. Use cat /sys/fs/cgroup/cpu/cpu.stat to see if CFS throttling is affecting your profile.

container_profiling.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Docker: run with perf_events capability
docker run --cap-add=SYS_ADMIN -p 8080:8080 my-java-app

# Or use itimer mode (no special permissions needed)
./profiler.sh -e itimer -d 30 -f flame.html <pid>

# Kubernetes: enable perf_event_open (if container runtime allows)
# Add to container securityContext:
# securityContext:
#   capabilities:
#     add: ["SYS_ADMIN"]

# JFR always works — no special container setup needed
jcmd <pid> JFR.start name=container_profile duration=60s filename=contained.jfr

# Check CPU throttling status
cat /sys/fs/cgroup/cpu/cpu.stat
Container Profiling Layers
  • JFR: works inside any container, but doesn't see other pods' resource usage.
  • async-profiler perf_events: needs extra privileges; reflects container-limited CPU time.
  • async-profiler itimer: works without privileges; timer-based, slightly less accurate but safe.
  • Node-level tools (perf top, /sys/fs/cgroup): give the host-level perspective you need for cross-container contention.
Production Insight
async-profiler CPU mode fails silently in restricted containers — always test with -e itimer first.
Kubernetes CPU throttling can inflate CPU profiles; cross-check with kubectl top pod.
JFR is container-friendly but captures only the container's perspective — not host contention.
Check cat /sys/fs/cgroup/cpu/cpu.stat for throttled time.
Automate JFR dump collection to object storage — you'll need it during incidents.
Use cat /sys/fs/cgroup/cpu/cpu.stat to see if CFS throttling is affecting your profile.
Key Takeaway
Always verify profiler permissions inside containers before production emergencies.
Use JFR for safe container profiling; fall back to async-profiler's itimer mode if perf_events is blocked.
Never trust a container CPU flame graph without checking the container's CPU throttling metrics.
When in doubt, start with JFR — it always works in containers.
Automate JFR storage: cron job dump to object storage for post-incident analysis.
Container Profiling Route
Ifasync-profiler fails with 'No access to perf events'
UseUse -e itimer or switch to JFR.
IfContainer CPU throttling suspected
UseCross-check with kubectl top pod and node-level metrics.
IfNeed host-level view of CPU/memory contention
UseUse node-level perf top or cgroup stats on the host.

Advanced Heap Dump Analysis with OQL

Eclipse MAT's Leak Suspects report is great for a first pass, but sometimes you need surgical precision. OQL (Object Query Language) is a SQL-like query language for heap dumps that lets you find specific objects, count instances, explore references, and even compute retained sizes programmatically.

OQL is available in Eclipse MAT and also in jhat (deprecated). In MAT, open the heap dump, then click the 'OQL' tab. Common queries:

  • SELECT * FROM java.util.HashMap — lists all HashMap instances (useful for caching leaks).
  • SELECT toString(o), o.@usedHeapSize FROM java.lang.String o WHERE o.@usedHeapSize > 100000 — find large strings that might be eating memory.
  • SELECT * FROM io.thecodeforge.service.MyService s WHERE s.cache.@usedHeapSize > 500000000 — checks if a specific service's cache exceeds 500MB.
  • SELECT DISTINCT OBJECTS classOf(o) FROM OBJECTS (SELECT * FROM java.lang.Thread) — list all classes that hold references to threads (great for thread leak detection).

Production scenario: A team noticed heap growing slowly but the Leak Suspects report gave vague results. They ran an OQL query to find all objects of a specific logger class that had accumulated millions of entries due to a missing ttl. The OQL showed the exact count and the retaining call stack, leading to the fix within minutes.

OQL also supports path expressions: SELECT OBJECTS a FROM INSTANCEOF java.lang.ref.Finalizer a — shows all finalizable objects, a notorious source of delayed memory leaks.

Important: OQL queries can be slow on large dumps. Always filter with WHERE clauses and avoid unconstrained SELECT * on huge classes.

You can also use OQL to compute retained sizes programmatically without manually navigating the dominator tree. The @usedHeapSize pseudo-field is your friend.

Here's an OQL trick that's saved me hours: SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000. This finds individual map entries with large retained heaps — the exact objects you need to evict. Most leak investigations should start here, not with the Leak Suspects report.

One more advanced use: OQL's INSTRUMENTS clause lets you execute JavaScript-like expressions. For example, to find objects whose class name matches a regex: SELECT * FROM java.lang.Object o WHERE /Cache$/.test(o.class.name). This catches multiple cache implementations without spelling each one out.

Start your leak investigation with SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000 — it finds the exact entries to evict.

OQL also allows grouping: SELECT c.name, sum(c.@usedHeapSize) FROM OBJECTS (SELECT * FROM java.lang.Object) o LET c = o.@class GROUP BY c.name ORDER BY sum(c.@usedHeapSize) DESC gives you a ranked list of classes by total retained heap. This is faster than the Histogram view for large dumps.

A trick: Use SELECT * FROM INSTANCEOF java.lang.ThreadLocal to find ThreadLocal instances that may hold large objects.

heap_oql_queries.txtSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- Find all HashMap instances with their size
SELECT toString(map), map.@usedHeapSize FROM java.util.HashMap map

-- Find strings bigger than 100KB
SELECT toString(s), s.@usedHeapSize FROM java.lang.String s WHERE s.@usedHeapSize > 100000

-- Find all instances of a specific class and see which objects reference them
SELECT OBJECTS ref FROM OBJECTS (SELECT * FROM io.thecodeforge.cache.SessionCache c) AS obj
  JOIN OBJECTS ref WHERE ref = obj

-- List all finalizable objects (potential leak via finalize())
SELECT * FROM INSTANCEOF java.lang.ref.Finalizer

-- Group by class and sum retained heap (faster than histogram for large dumps)
SELECT c.name, sum(c.@usedHeapSize) FROM OBJECTS (SELECT * FROM java.lang.Object) o LET c = o.@class
  GROUP BY c.name ORDER BY sum(c.@usedHeapSize) DESC

-- Find large Map entries (start here for leak investigation)
SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000
OQL for Thread Leak Detection
Run SELECT DISTINCT OBJECTS classOf(t.target) FROM INSTANCEOF java.lang.Thread t WHERE t.@usedHeapSize > 1000000 to find thread objects holding significant heap — a sign of thread-local accumulation.
Production Insight
OQL can pinpoint leaks that automated reports miss — especially when the leak is spread across many small objects.
Always prefix queries with SELECT * and add WHERE conditions to limit scope; unconstrained queries can take minutes on large dumps.
Use @usedHeapSize to get the retained size directly, avoiding manual dominator tree navigation.
Start with Map$Entry queries to find the largest individual entries.
Group-by OQL queries are faster than the Histogram view for identifying top-consuming classes.
OQL can find leaks that automated reports miss; always combine with allocation profiling.
Key Takeaway
OQL is your scalpel for heap analysis when automated reports aren't enough.
Learn 5 essential queries: find instances, compute retained size, trace references, list finalizers, and group by class.
Always filter large results — unconstrained queries are slow and produce noise.
The @usedHeapSize pseudo-field is faster than navigating the dominator tree manually.
Use grouped OQL to get a ranked list of top memory consumers.
When to Use OQL vs MAT Leak Suspects
IfLeak Suspects report is ambiguous or shows many suspects
UseRun OQL to find large objects of known suspect classes (e.g., HashMap, String).
IfSuspect thread leak (many threads or thread locals)
UseUse OQL to query for Thread instances with large retained heap.
IfWant to compute retained size of a specific service or component
UseUse OQL with @usedHeapSize on that class's instances.

Performance Tuning Workflow: A Step-by-Step Production Example

Now let's put everything together with a real workflow. Suppose you have a payment processing service that's showing 99th percentile latency of 2 seconds during peak hours. Here's the exact sequence:

  1. Start JFR recording on one canary instance. Use the profile template for 5 minutes. This captures GC events, allocation rates, thread CPU, and lock contention.
  2. Load the JFR dump in JDK Mission Control. Go to the 'GC Pauses' view. If you see pauses >100ms, you have a GC problem. If the allocation rate is >500MB/s, you have an allocation problem.
  3. If GC is not the dominant issue, attach async-profiler for a wall-clock sample. Look for wide frames at the top. If you see java.net.SocketInputStream.socketRead0 wide, the service is waiting on network I/O.
  4. If allocation is high, run ./profiler.sh -e alloc -d 30 -f alloc.html <pid>. The allocation flame graph will show which call sites create the most objects.
  5. After identifying the hot spot, implement the fix (e.g., cache, pool, reduce object creation).
  6. Redeploy the canary and repeat steps 1-2 with the same JFR settings. Compare the new recording with the baseline.

This loop — profile, diagnose, fix, verify — is the only reliable way to tune performance. Guessing leads to wasted sprints.

Here's a Java example of a method that's a common allocation hotspot and its fix:

```java // Before: creates StringBuilder on every call public static String before(String prefix, int id) { return prefix + \":\" + id; // compiles to new StringBuilder().append()... }

// After: use String.format? No, that creates even more objects. // The real fix: if called thousands of times per second, inline explicitly: public static String after(String prefix, int id) { // Explicit concatenation — javac may optimize to StringBuilder anyway // But if prefix is constant, cache the template: return prefix + \":\" + id; // Real fix for high-frequency: pass parts directly, avoid intermediate strings } ```

Remember: always measure before and after. A change that looks smart on paper may not move the needle.

In practice, 80% of improvements come from the first two hot spots. Once you've addressed those, diminishing returns set in fast. Stop when you meet SLO — don't over-optimise.

One more production lesson: never trust a microbenchmark. The way your code runs in a JMH harness is completely different from how it runs under real load with GC, JIT warmup, and memory pressure. Always validate optimizations in production with the full profiling pipeline. If a change shows no improvement in the JFR comparison, revert it. Code complexity without performance gain is a net negative.

A story: One team spent a week optimizing a method that accounted for 5% of CPU; they missed the real bottleneck in JDBC pooling. Profiling first would have saved them days. Another team over-optimized by adding complex caching that caused memory pressure. They had to revert. The lesson: stop when SLO is met.

io/thecodeforge/performance/AllocationFix.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package io.thecodeforge.performance;

public class AllocationFix {
    // Before: creates a new StringBuilder on every call
    public static String before(String prefix, int id) {
        return prefix + ":" + id;  // compiles to new StringBuilder().append()...
    }

    // After: avoid intermediate allocation by using explicit concatenation
    // (javac still creates StringBuilder, but at least it's clear)
    public static String after(String prefix, int id) {
        return prefix + ":" + id;
        // The real fix: if you can, pass the parts directly
    }

    // For high-frequency logging, cache the template
    private static final String TEMPLATE = "%s:%d";
    public static String withFormat(String prefix, int id) {
        return String.format(TEMPLATE, prefix, id);
        // Caution: String.format creates a Formatter and many objects
        // Always profile to see if this is actually faster
    }
}
The Profiling Loop
  • Step 1: Start with JFR (default or profile) for 5 minutes on a canary.
  • Step 2: Analyze the recording — find the top 3 hot spots by CPU, allocation, or latency.
  • Step 3: Implement one change per iteration; never batch multiple optimizations.
  • Step 4: Redeploy the canary and profile again with the same settings.
  • Step 5: Compare before/after recordings. Did the hot spot shrink?
Production Insight
Never make more than one performance change per deployment cycle.
Always run a before-and-after profiling session with identical settings.
Most performance gains (80%) come from fixing the top 2 hot spots.
The rest is diminishing returns — stop when you hit acceptable latency.
Don't over-optimise: once SLO is met, move on.
Never trust microbenchmarks — validate in production with real load.
Stop optimizing when SLO is met; over-optimization adds complexity without benefit.
Key Takeaway
Performance tuning is a three-step loop: profile, fix, verify.
Never batch changes — you won't know what worked.
80% of improvement comes from the first two hot spots.
When latency meets SLO, stop — don't over-optimize.
Validate every change in production with before/after JFR recordings.
Tuning Iteration Decision
IfAfter fix, hot spot moved elsewhere
UseThe bottleneck shifted — fix the new hot spot.
IfAfter fix, latency improved but still above SLO
UseContinue the loop; fix next hot spot.
IfAfter fix, latency within SLO
UseStop. Document the change and move on.

G1 GC Concurrent Mode Failure: Diagnosis and Tuning

Concurrent mode failure is G1 GC's worst-case scenario. It happens when the concurrent marking phase cannot finish before the old generation fills up. The JVM then falls back to a stop-the-world (STW) full GC, which compacts the entire heap and can pause application threads for seconds to tens of seconds.

The root cause is usually an allocation rate that exceeds the concurrent marker's throughput. G1's concurrent marking is designed to run in the background while the application continues. But if the application allocates faster than the marker can process, the heap occupancy rises past the threshold set by -XX:InitiatingHeapOccupancyPercent (IHOP, default 45%). Once occupancy exceeds IHOP, G1 triggers the concurrent cycle. If the cycle can't complete before the heap is completely full, concurrent mode failure occurs.

Diagnosis
  • GC logs will show 'Concurrent Mode Failure' followed by a full GC (e.g., 'Full GC (Allocation Failure)' with 'Pause Full (G1 Compaction Pause)').
  • JFR recordings will show a long pause event of type 'G1 Pause Full' with duration in seconds.
  • The allocation rate in the recording will be high, often >500MB/s.

Tuning knobs: 1. Reduce IHOP: Lower -XX:InitiatingHeapOccupancyPercent to start concurrent marking earlier. Values between 30% and 40% are common. This gives the marker more time to finish before the heap fills. 2. Increase heap size: More heap means more runway before saturation. If you have 4GB heap and allocate 500MB/s, the heap fills in ~8 seconds. With 8GB, you get ~16 seconds. 3. Increase heap region size (-XX:G1HeapRegionSize): Larger regions reduce the marking bitmap overhead and can speed up concurrent marking. For heaps >16GB, try 16MB or 32MB. 4. Increase the number of concurrent marking threads (-XX:ConcGCThreads): Default is often (ParallelGCThreads + 2) / 4. You can increase it, but be aware it steals CPU from application threads.

Production example: In the incident described earlier, the service had 4GB heap with default IHOP=45 and allocation rate ~300MB/s. The concurrent marking took ~8 seconds, but the heap filled in ~10 seconds. That 2-second gap was too tight. By reducing IHOP to 30, the concurrent cycle started earlier, and by increasing heap to 8GB, the filling time doubled to ~20 seconds, giving the marker plenty of room.

Prevention
  • Always monitor allocation rates along with GC logs. A sudden increase in allocation rate is an early warning.
  • Set up alerts on 'Concurrent Mode Failure' in GC logs or JFR events.
  • Use JFR's streaming API to trigger an alert if the allocation rate exceeds a threshold (e.g., 400MB/s) for more than 10 seconds.

Trade-off: Reducing IHOP means G1 spends more CPU on concurrent marking, which can reduce application throughput by 5-10% during marking. For most services, this is an acceptable trade-off to avoid STW pauses.

Remember: if you're seeing concurrent mode failure, the first thing to check is allocation rate. If it's >500MB/s, optimise allocation before tuning GC. The tuning buys time, but reducing allocation rate is the permanent fix.

g1_concurrent_tuning.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Check GC logs for Concurrent Mode Failure
grep 'Concurrent Mode Failure' gc.log

# JFR recording to capture GC events
jcmd <pid> JFR.start name=g1profile duration=120s filename=g1_profile.jfr settings=profile

# Recommended flags after diagnosing CMF
-XX:+UseG1GC
-XX:InitiatingHeapOccupancyPercent=30
-Xms8g -Xmx8g
-XX:G1HeapRegionSize=16m
-XX:+UnlockExperimentalVMOptions
-XX:G1NewSizePercent=5
-XX:MaxGCPauseMillis=100

# Monitor allocation rate from JFR (requires streaming API or analysis)
jfr-stream <pid> | grep allocation
Allocation Rate First
Before tuning IHOP or heap size, check the allocation rate using JFR or async-profiler alloc mode. If your allocation rate exceeds 500MB/s, no amount of GC tuning will eliminate concurrent mode failure permanently — you must reduce allocation at the source.
Production Insight
Concurrent mode failure is always a symptom of an allocation rate that outruns the concurrent marker.
Reducing IHOP to 30 gives the marker a head start but costs 5-10% CPU during marking.
Increasing heap size is a band-aid; the real fix is reducing allocation rate.
Monitor allocation rate trends: a gradual increase over weeks indicates a leak, not just a tuning issue.
Test IHOP changes under peak load; a setting that works at 5k req/min may fail at 10k.
Concurrent mode failure is preventable with proactive allocation rate monitoring.
Key Takeaway
Concurrent mode failure means G1 can't keep up — reduce IHOP and/or increase heap.
Allocation rate is the root cause; fix it before relying on tuning.
Monitor GC logs and allocation rate together to catch CMF early.
Prevention: set up JFR streaming alerts on allocation rate spikes.
Reducing IHOP is a trade-off: lower pause risk for higher CPU usage during marking.
Concurrent Mode Failure Action Plan
IfAllocation rate < 500MB/s and heap not near max
UseReduce IHOP to 30 and increase heap by 50%.
IfAllocation rate > 500MB/s
UseProfile allocation sources first; fix code before tuning GC.
IfAfter tuning, still seeing CMF
UseConsider switching to ZGC or Shenandoah for sub-millisecond pauses.
● Production incidentPOST-MORTEMseverity: high

The 2 AM Latency Spike: G1 Concurrent Mode Failure

Symptom
Every 30 minutes during high traffic (8k req/min), response times jumped from 15ms to 3 seconds, then the service returned 503 for about 20 seconds before recovering. No errors in app logs, but GC logs showed 'Concurrent Mode Failure'.
Assumption
The team assumed a traffic spike was overwhelming the thread pool. They doubled the instance count — no improvement. The pattern persisted.
Root cause
G1 GC was configured with default -XX:InitiatingHeapOccupancyPercent=45. Under high allocation rates, the concurrent marking phase couldn't keep up, triggering a full STW (stop-the-world) compaction. The 20-second pause matched the errors exactly.
Fix
Reduced -XX:InitiatingHeapOccupancyPercent to 30 to start concurrent marking earlier, and increased heap from 4GB to 8GB to give G1 more runway. Full STW pauses dropped to zero. Latency returned to 15ms.
Key lesson
  • Never tune GC without first profiling allocation rates and pause frequencies.
  • Monitor GC logs alongside application metrics — they tell the real story.
  • Always test GC changes under production-like load; a config that works at 1k req/min may fail at 10k.
  • Concurrent mode failure is the production killer; know your IHOP setting before you need to change it.
Production debug guideWhen your app slows down or starts leaking memory, use this symptom-to-action routing to pick the right profiling tool and command.8 entries
Symptom · 01
CPU usage high but no obvious hot method in logs
Fix
Attach async-profiler for a CPU profile: ./profiler.sh -e cpu -d 30 -f cpu.html <PID>
Symptom · 02
Heap grows over hours without recovery — suspect leak
Fix
Take a live heap dump: jmap -dump:live,format=b,file=heap.hprof <PID>. Analyze with Eclipse MAT or JProfiler.
Symptom · 03
Intermittent latency spikes with no CPU saturation
Fix
Collect a wall-clock profile: async-profiler -e wall -d 30 -f wall.html <PID>. Look for lock contention or GC pauses.
Symptom · 04
Thread pool reports 'queue full' exceptions
Fix
Capture thread dumps every 5 seconds for 30 seconds: jstack <PID> > threaddump.txt (repeat). Look for threads in BLOCKED or WAITING state.
Symptom · 05
App crashes with OutOfMemoryError after hours of uptime
Fix
Add -XX:+HeapDumpOnOutOfMemoryError to JVM flags. The next crash generates a heap dump at the exact moment of failure.
Symptom · 06
Performance regression after deployment — no visible symptom change
Fix
Compare JFR recordings from before and after deployment using JDK Mission Control's automated analysis. Focus on GC pause times, allocation rates, and lock contention.
Symptom · 07
Application starts slow after deployment
Fix
Capture a JFR recording with startup events: -XX:StartFlightRecording=filename=startup.jfr,settings=profile. Analyze the initialization phase in JDK Mission Control.
Symptom · 08
Response time correlated with GC logs; suspect concurrent mode failure
Fix
Inspect GC logs for 'Concurrent Mode Failure' or 'G1 Evacuation Pause SS:'. Check InitiatingHeapOccupancyPercent and use jcmd to trigger a GC dump.
★ 5-Second Profiling Commands for Production EmergenciesWhen shit hits the fan, you don't have time to read docs. These commands work on any modern Java 11+ JVM with minimal overhead.
CPU spike, no obvious culprit
Immediate action
Grab a CPU flame graph with async-profiler
Commands
./profiler.sh -e cpu -d 15 -f flame.html <PID>
Open flame.html in browser — look for the widest frames
Fix now
If one method dominates (e.g., regex or serialization), cache or replace it
Heap climbing, likely leak+
Immediate action
Trigger a live heap dump
Commands
jcmd <PID> GC.heap_dump /tmp/dump.hprof
Open in Eclipse MAT → Dominator Tree → look for biggest objects
Fix now
Identify the holding root (e.g., static cache, thread-local) and add eviction or weak references
Allocations causing excessive GC+
Immediate action
Run allocation profile with async-profiler
Commands
./profiler.sh -e alloc -d 30 -f alloc.html <PID>
Filter by allocation size — look for escape-analyzed objects
Fix now
Inline object creation or pool objects: StringBuilder instead of String concat, Array instead of ArrayList
Service unresponsive, threads stuck+
Immediate action
Capture a thread dump stack trace
Commands
jstack <PID> > /tmp/threads_1.txt; sleep 5; jstack <PID> > /tmp/threads_2.txt
Compare two dumps — threads that haven't moved are likely deadlocked
Fix now
Kill the node and fix the lock ordering; add timeout to blocking calls
Performance degraded after container restart+
Immediate action
Check if JFR is still enabled — container restarts lose JVM flags
Commands
jcmd <PID> JFR.check
If no recording, restart with -XX:StartFlightRecording or attach via jcmd
Fix now
Add always-on JFR startup flags to your Dockerfile or deployment template
GC pauses causing latency spikes+
Immediate action
Check GC logs and JFR recording
Commands
jcmd <PID> JFR.dump name=gcrecording filename=gc_pauses.jfr
Open in JDK Mission Control → GC Pauses view
Fix now
If concurrent mode failure, reduce IHOP or increase heap size

Common mistakes to avoid

5 patterns
×

Optimizing without profiling first

Symptom
Weeks spent tuning the wrong code path; no improvement in latency or throughput.
Fix
Always start with JFR or async-profiler to identify the actual hot spots before making any changes.
×

Using jmap -histo without :live

Symptom
Heap dump includes garbage objects, making analysis misleading and slow.
Fix
Use jmap -histo:live or jcmd <pid> GC.heap_dump for a live dump that excludes unreachable objects.
×

Ignoring wall-clock profiling

Symptom
CPU looks fine but latency is high; the bottleneck is lock contention or I/O.
Fix
Use async-profiler's wall-clock mode (-e wall) to capture blocked threads and correlate with thread dumps.
×

Tuning GC before measuring allocation rate

Symptom
After extensive tuning, GC still causes pauses because allocation rate overwhelms the collector.
Fix
Profile allocation rate with JFR or async-profiler alloc mode first. If rate >500MB/s, fix allocation before tuning GC.
×

Assuming one heap dump tells the full story

Symptom
Leak Suspects report points at a large object, but it's a stable cache, not a leak.
Fix
Take two heap dumps an hour apart and compare retained heap sizes of suspect classes to calculate leak rate.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Explain the difference between JFR and async-profiler. When would you us...
Q02SENIOR
Your team reports a 20-second latency spike every 30 minutes during peak...
Q03SENIOR
How do you find the root cause of a memory leak in a Java application? D...
Q04SENIOR
What are the key differences between G1, ZGC, and Shenandoah? When would...
Q05SENIOR
What precautions do you take when profiling a production JVM?
Q01 of 05SENIOR

Explain the difference between JFR and async-profiler. When would you use each?

ANSWER
JFR is built into the JDK and records a wide range of events (GC, allocations, JIT, I/O) with less than 1% overhead. It's designed for continuous monitoring — you start it and leave it running. async-profiler is an external agent that uses kernel sampling (perf_events) for CPU and allocation flame graphs. It's better for targeted, ad-hoc investigations because you can attach it to a running process and get a flame graph in seconds. Use JFR for always-on observability and async-profiler when you need a deep dive on a specific symptom. They complement each other: run JFR continuously, attach async-profiler when you see an anomaly in JFR data.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What's the difference between CPU profiling and wall-clock profiling?
02
Can I use JFR and async-profiler simultaneously?
03
How do I know if my GC needs tuning?
04
Why does async-profiler fail in my Kubernetes container?
05
What's the quickest way to find a memory leak in production?
🔥

That's Advanced Java. Mark it forged?

25 min read · try the examples if you haven't

Previous
Maven vs Gradle in Java
23 / 28 · Advanced Java
Next
Java Logging with SLF4J and Logback