G1 GC Concurrent Mode Failure — Fix 20s STW Pause
At 8k req/min, G1 GC's default IHOP=45 triggered Concurrent Mode Failure, causing 20-second pauses and 503 errors.
- Java profiling measures CPU, memory, thread, and I/O usage in a live application
- JFR (Java Flight Recorder) is built-in, low-overhead event recording for continuous monitoring
- async-profiler combines CPU sampling, allocation profiling, and wall-clock profiling in one agent
- Flame graphs visualize call stacks as rectangles; width = time spent, color = function type
- Biggest mistake: optimizing without profiling first — you'll guess wrong every time
- Performance insight: JFR targets <1% overhead; async-profiler adds ~2% during sampling
Imagine your Java app is a restaurant kitchen. Orders are coming in, but food is taking forever to reach customers. Profiling is like installing cameras and timers on every chef, every station, and every oven — so you can see exactly WHO is slow, WHERE the bottleneck is, and WHY the kitchen is on fire. Without profiling, you're just guessing. With it, you walk straight to the broken fryer.
Most Java performance problems don't announce themselves. They show up as mysterious latency spikes at 2am, a heap that grows 10MB per hour until the app dies, or a thread pool that silently saturates under load while your dashboards look green. These aren't bugs in the traditional sense — they're invisible tax your code pays at scale, and they're the kind of thing that separates senior engineers from everyone else.
Profiling is the discipline of measuring your running application to find exactly where CPU time, memory, threads, and I/O are going — before you optimize anything. The cardinal sin in performance work is optimizing without data. You'll almost always guess wrong, spend a week tuning the wrong method, and make the codebase harder to read for zero gain. Good profiling tools give you a flame graph that says 'this one method accounts for 43% of your CPU' — and suddenly the path forward is obvious.
This guide gets straight to the point: which profiling tool for which symptom, how to read flame graphs and heap dumps without getting lost, the production safeguards that prevent profiling from becoming the cause of your next incident, and the one workflow that makes performance tuning repeatable. Time to get into the details.
Here's the hard truth: most teams waste weeks on performance work because they skip the baseline. You need to know what 'normal' looks like before you can spot abnormal. That means setting up JFR on day one, even if you're not debugging anything. The data you collect when everything is fine is your most valuable asset when things break.
What is Java Profiling and Performance?
Java Profiling and Performance is a core concept in Java. Rather than starting with a dry definition, let's see it in action and understand why it exists.
When you run a real server, profiling gives you the answers to three hard questions: where is the CPU going, where is the memory going, and what is the application waiting on? Without this data, every optimisation is guesswork. The tools we'll cover — JFR, async-profiler, and heap dump analyzers — let you answer those questions without restarting your JVM or modifying your code. They attach to a running process and record what's happening in real time, with overhead so low you can run them in production during business hours.
Here's a concrete trap: a team spent two weeks optimising a database query that appeared slow in their test environment. After profiling the production instance, they discovered the actual bottleneck was thread contention in their object mapper — the query was fine. Profiling first would've saved them a sprint.
Think of profiling as your JVM's black box recorder — you want it running before the incident, not after. The same way you'd never debug a plane crash without the flight data recorder, you shouldn't debug a production slowdown without profiling data. That's why always-on JFR is the first tool you set up.
Here's the thing: profiling isn't just about finding hot spots. It's about building a baseline. If you don't know what normal looks like — normal allocation rate, normal GC pause distribution, normal thread states — you can't recognise abnormal when it hits. Start recording today, even if you're not debugging anything. You'll thank yourself at 3am six months from now.
One more nuance: profiling doesn't replace good logging. It complements it. Logs tell you what happened; profiling tells you why the CPU was pinned. Always correlate profiler data with your application logs and metrics dashboards.
Without profiling, every optimization is a shot in the dark — and your users pay for every miss.
Let's expand the mental model: think of your JVM as a factory floor. CPU profiling is a heat map of machine usage, memory profiling is an inventory of raw materials, and wall-clock profiling is a stopwatch for every process. The manager (you) uses all three to find the bottleneck. A common mistake is focusing only on CPU – memory leaks and lock contention are just as critical. Always collect all three pillars before making a change.
Production story: A team's CPU profile looked clean, but latency was high. They assumed network. Wall-clock profiling revealed lock contention on a shared cache. The fix was a read-write lock, slashing latency by 60%. If they'd only looked at CPU, they'd have missed it entirely.
- CPU profiling shows which methods consume processor cycles — hot spots from intense computation.
- Memory profiling tracks object allocations and heap occupancy — finds leaks and high allocation rates.
- Wall-clock profiling measures actual elapsed time — captures blocking on locks, I/O, and GC pauses.
Core Profiling Tools: JFR and async-profiler
Two tools dominate modern Java profiling: Java Flight Recorder (JFR) and async-profiler. JFR is built into the JDK since version 11 (formerly commercial-only). It records fine-grained events — GC pauses, thread allocations, JIT compilations, IO operations — with less than 1% overhead. You start it with jcmd <pid> JFR.start and dump a recording file later. No JVM restart needed.
async-profiler is an open-source agent that uses a combination of perf_events (on Linux) and a custom JVMTI agent to produce CPU and allocation flame graphs. It's the go-to tool for ad-hoc profiling because you can attach it to a running process, collect a 30-second sample, and get an interactive HTML flame graph you can share with the team. It supports two sampling modes: CPU (only samples running threads) and wall (samples all threads, including those blocked on IO or locks).
The choice depends on your use case: JFR for continuous, always-on monitoring (think of it like a JVM black box); async-profiler for targeted investigations when you suspect a specific function or class.
One important nuance: JFR's profile.jfc template samples method stacks at a fixed frequency, while async-profiler uses kernel sampling via perf_events. The two approaches can give different results for very short methods. Always validate hot spots with both tools if the impact is high.
Production trap: async-profiler's CPU mode relies on perf_events, which can conflict with container CPU limits in Kubernetes. If your pod is throttled, you'll see inflated CPU percentages in the flame graph. Always correlate with host-level metrics.
There's a nuance though: if you're in a container without SYS_ADMIN, async-profiler CPU mode won't work. Fall back to -e itimer or JFR. We'll cover container profiling later.
One more thing: don't sleep on JFR's event streaming API. Since JDK 14, you can subscribe to JFR events programmatically — no dump files, no file I/O. You get a live stream of GC pauses, allocation ticks, and lock contention as they happen. It's the best way to build custom monitoring without adding agents.
Another hidden feature: JFR can record network I/O and file system events. Use the -XX:+UnlockDiagnosticVMOptions -XX:+FlightRecorder flags to enable socket reads/writes in the recording. This helps when the bottleneck is external.
JFR's event streaming API is your secret weapon for real-time observability — use it to trigger alerts on allocation rate spikes without polluting your logs.
Also consider combining both tools: run JFR continuously, and when you see an anomaly in the JFR dashboard, attach async-profiler for a deep dive. This hybrid approach gives you both broad coverage and surgical precision. The cost is minimal — JFR overhead is <1% and async-profiler is temporary.
Production story: A team used async-profiler CPU mode and saw no hot methods. But JFR's allocation profile revealed a massive allocation rate from a logging framework at DEBUG level. They turned down log verbosity, saving 20% CPU — a fix that CPU profiling alone couldn't find.
Reading Flame Graphs: What to Look For
A flame graph is a visual representation of a stack trace sample set. The x-axis groups stack frames alphabetically, and the width of each rectangle is proportional to the number of samples that included that frame. The y-axis is the stack depth — the top is the function actually running, and below it are its callers. Color typically indicates the function type (red for native, yellow for Java, green for interpreted, blue for GC, etc.).
When reading a flame graph, start at the top and look for the widest frames. Those are your hot spots. A common trap is staring at a wide frame that's a low-level method like Unsafe.park() — that's not the culprit; its caller is. Always trace wide frames upward to find the application code that triggers them. If the graph shows many thin, tall towers, you have deep call stacks — often recursion or poorly designed frameworks. If the graph looks like a plateau (many wide frames at similar depth), you have multiple hot spots.
Another pattern: a 'mountain' shape with a single wide top — that's your one bottleneck. A 'volcano' with multiple peaks — load is spread across several paths; optimising any one may shift the bottleneck without much improvement overall.
Real-world mistake: One team saw a wide frame for java.util. and assumed they needed a faster hash map. But the flame graph showed it was called from a logging framework at DEBUG level. Turning down log verbosity fixed the CPU usage in 30 seconds. The frame width lies without context.HashMap.put()
You'll also encounter 'icicle' graphs (inverted) where the root is at the bottom — those are common in async-profiler's output. Same reading technique: look for widest at the top of the icicle.
Here's a rule I've learned the hard way: always generate a flame graph during both peak and off-peak load. A flame graph from a quiet period shows you nothing useful. The hot spots only reveal themselves under pressure. Profile under load or don't profile at all.
Also note: flame graphs can be misleading for lock contention. A thread blocked on a lock doesn't appear CPU-sampled. That's why you need wall-clock mode. Always cross-reference flame graphs with thread dumps.
Short-lived methods can be invisible in CPU flame graphs — use JFR to capture them. Every hot spot you fix changes the shape of the graph; retest after each optimisation.
One advanced tip: use differential flame graphs to compare two profiling sessions. For example, compare before and after a deployment. The diff shows exactly which methods got hotter or cooler. This is invaluable for detecting performance regressions. async-profiler doesn't generate diffs natively, but you can use the FlameGraph toolkit's difffolded.pl script.
Another common trap: reading a flame graph from a single sample. Always take multiple samples over time to see the trend. A 30-second sample may miss intermittent spikes.
- Mountain shape: one dominant bottleneck — optimize this method.
- Volcano shape: multiple hot spots — improving one may shift bottleneck.
- Plateau shape: many methods consuming roughly equal time — focus on allocation or I/O.
Heap Analysis: Finding the Leak
Memory leaks in Java are almost never about unreachable objects — those get GC'd. The real leaks come from accidental retention: objects that remain reachable but are no longer needed. Common patterns include static collections (caches without eviction), thread-local variables that accumulate, or JDBC statements not closed. Profiling a leak means taking a heap dump at a point when you suspect the heap has grown, then analyzing it to find the 'dominator' objects that hold the most memory.
Eclipse MAT (Memory Analyzer Tool) is the de facto standard for this. You load a heap dump, run the 'Leak Suspects' report, and it highlights the biggest retained sets. The 'Dominator Tree' view shows which objects would be freed if a given root were removed. For example, if a single HashMap holds 90% of the heap with 2 million stale entries, that's the leak.
Important: take a live heap dump (jmap -dump:live) to exclude unreachable objects. Full heap dumps include all objects and take much longer to process. Always capture a few dumps over time to see the growth rate — one snapshot can't tell you if the growth is a leak or just a large but stable cache.
Also consider using JFR's allocation profiling to find the call sites that produce the most garbage. Sometimes the fix is not to remove the collection but to reduce its creation rate.
Production nuance: A heap dump from a process that's about to OOM might be truncated — critical objects could be missing. Always capture a second dump after recovery to compare. Also, the Leak Suspects report is a heuristic; always verify by examining the retaining stack traces in the dominator tree.
I once saw a Leak Suspects report point at a HashMap in logging, but the real culprit was a thread-local cache in the user session handler. Always cross-check with the thread overview and dominator tree.
Here's the uncomfortable truth about heap analysis: by the time you notice the leak, it's been running for days. You need to calculate the leak rate. Take a dump, wait an hour, take another. If the retained heap of a suspect class grew by 200MB, you have a leak rate of ~3.3MB/min. That number tells you how long you have before the next OOM — and whether a hotfix can wait until the next release cycle.
One more tip: use jcmd instead of jmap on JDK 11+. It's faster, safer, and doesn't force a GC unless you specify live.
Leak rate calculation is your timeline to failure — use two dumps taken an hour apart to compute it. A leak rate of 5MB/min means you have hours, not days.
For advanced users: combine heap analysis with allocation profiling. Use async-profiler's -e alloc to see which code paths create the most objects, then correlate those call sites with the objects found in the heap dump. This cross-reference is far more powerful than either technique alone.
A story: A team's leak suspect report showed a HashMap in logging, but after using OQL to find the largest entries, they discovered it was a cache in their session handling. The fix was adding TTL eviction, reducing heap growth from 10MB/min to 0.
-e alloc mode alongside heap dumps to pinpoint the code that allocates the most memory.GC Tuning: The Production Reality
GC tuning is the most overrated performance activity. Most applications need none — default settings with G1 (Java 9+) or Parallel (pre-9) work fine up to moderate loads. Tuning only matters when you have evidence from profiling that GC is causing latency or throughput issues. That evidence comes from JFR GC events or from explicit GC logs.
When you do need to tune, the three most impactful knobs are: 1. Heap size (-Xms, -Xmx): Too small causes frequent GC, too large causes long pauses. Start with matching initial and max to reduce resizing overhead. 2. Pause time goal (-XX:MaxGCPauseMillis for G1): The GC tries to keep pauses under this, but may increase frequency or reduce throughput to meet it. A tight goal (like 10ms) forces more minor GCs. 3. InitiatingHeapOccupancyPercent (G1): When the heap occupancy after marking reaches this threshold, G1 triggers concurrent marking. Lower it to start earlier (reduces risk of concurrent mode failure). The default is 45%; reducing to 30% gives more time for concurrent work.
For ZGC and Shenandoah, the story is different: they aim for sub-millisecond pauses at the cost of some CPU overhead (ZGC uses load barriers, Shenandoah uses forwarding pointers). They shine on very large heaps (100GB+) but have higher baseline CPU usage. Profile your allocation rates first — if you're allocating 50GB/min, no GC will be happy.
One subtle trap: the -XX:MaxGCPauseMillis flag is a goal, not a guarantee. G1 will adjust the region set to try to meet it, but under high allocation pressure it may not be achievable. Monitor gc+pause logs to see if the target is consistently missed.
Real example: A team set MaxGCPauseMillis=50 on a 32GB heap with 80GB/min allocation rate. G1 could not meet the target and started triggering back-to-back young GCs, causing 30% throughput loss. They had to increase the pause target to 200ms and optimise allocation rates instead.
Default G1 settings assume moderate allocation rates (~100MB/s). If you're allocating >500MB/s, you need to tune regardless of heap size. The allocation rate is the real determinant of GC pressure.
Here's what most guides won't tell you: -XX:G1HeapRegionSize matters more than people think. On a 64GB heap with default 2MB regions, you get 32000 regions. That's a lot of tracking overhead. Bump it to 16MB or 32MB. Fewer regions, less bookkeeping, better pause predictability. I've seen this single change reduce mixed GC pauses by 40%.
Allocation rate is the real enemy — if it's >500MB/s, GC tuning alone won't save you. Profile allocation sources and fix them before touching GC flags.
Another subtle point: the -XX:+UseStringDeduplication flag can reduce memory usage in applications with many duplicate strings. But it adds CPU overhead. Profile before enabling to ensure net gain.
Don't forget to check your GC logs for promotion failures. If young objects are being prematurely promoted to old gen due to small survivor spaces, you'll see increased full GCs. Adjust -XX:SurvivorRatio or -XX:NewRatio to give young gen more room.
A story: a team was seeing 20s STW pauses. They reduced IHOP from 45 to 30 and increased heap from 4GB to 8GB. Pauses dropped to <100ms. They also fine-tuned G1HeapRegionSize from 2MB to 16MB, further reducing mixed GC pauses by 40%. The key was profiling allocation rates first.
Production-Safe Profiling: Do's and Don'ts
Profiling in production requires caution. The wrong tool or command can pause your JVM for seconds or even minutes. Here's what works safely:
Do use: jcmd for JFR commands (no JVM pause), async-profiler with perf_events (zero overhead when not sampling), and jstack for thread dumps (pauses the target thread briefly, but acceptable).
Don't use: jmap -histo without :live — it does NOT trigger a GC and gives you all objects including garbage, misleading. jmap -clstats and jhat are deprecated and can slow down the JVM. Avoid attaching old JVMTI agents (like HPjmeter) that require the JVM to be started with -agentpath.
Rule of thumb: If a profiling command requires you to add JVM flags and restart, test it on staging first. If it attaches to a running process and claims <5% overhead, it's likely safe for production. Always start with a 10-second sample on a single instance, verify the JVM doesn't backpressure, then expand to longer durations.
Never profile every instance in a cluster simultaneously — the aggregate overhead can saturate host resources and cause a cascading failure.
Real production horror: An engineer attached async-profiler to all 20 instances of a payment service simultaneously. The CPU overhead from perf_events caused cascading timeouts. The team had to kill the profiler and restart half the cluster. Rule: one instance at a time.
I once saw an engineer attach async-profiler to a JVM that had 95% heap usage. The profiler triggered additional memory allocation and the process OOM'd within 30 seconds. Rule: never profile a JVM that's over 80% heap. The one command that's safe in any state is jstack. Even when heap is 99% full, jstack still works.
Another thing nobody tells you: JVM TI agents (including async-profiler) can cause transient performance degradation during attachment and detachment. The JVM needs to safepoint all threads to load the agent. On a 64GB heap with 200 threads, that safepoint can take 200-500ms. Schedule your profiling sessions during maintenance windows or off-peak hours.
Also watch out: if you're using jattach (the default async-profiler attach method), it uses a Unix domain socket. In some container environments (like those with read-only root filesystem), jattach may fail. In that case, use the -f (file) option or fallback to JFR.
Never profile a JVM over 80% heap — you'll push it into OOM. jstack is the only safe command in critical state.
One more safe practice: use the --sync flag with async-profiler to delay sampling until the profiler is fully attached. This avoids the initial safepoint overhead being captured as part of the profile. For example: ./profiler.sh -e cpu -d 30 --sync -f flame.html <pid>.
Profiling in Containers: Docker and Kubernetes Pitfalls
Containerized Java apps introduce new profiling complications that can lead to false data or no data at all. The core issue: perf_events (used by async-profiler's CPU mode) are restricted inside containers unless the container runs with elevated privileges or specific sysctl settings.
In Docker, you need --cap-add=SYS_ADMIN or --security-opt seccomp=unconfined to allow async-profiler CPU profiling. Without that, the profiler will fail with "No access to perf events." The safer workaround is to use async-profiler's -e itimer mode, which uses a timer-based approach instead of perf_events — slightly less precise but works in unconfined containers.
In Kubernetes, the situation is trickier. Even with perf_event_open allowed, container CPU limits via CFS can cause the profiler to see inflated CPU percentages because the kernel throttles the container. Your flame graph might show a wide __schedule frame — that's the throttling, not your code. Always correlate with container CPU usage metrics from cgroups.
JFR works reliably in containers because it doesn't depend on perf_events. However, JFR recordings from inside a container reflect only the container's view of CPU and memory. If you have a shared node, the JFR data won't show other containers' resource contention. Use kubectl top or node-level Prometheus metrics to get the full picture.
Production story: A team saw a recurring "CPU spike every 5 minutes" in their Kubernetes dashboard. async-profiler showed sun.rmi.transport.tcp.TCPTransport.handleMessages at the top — it was JMX RMI heartbeat threads battling with CFS throttling. They switched to a non-blocking JMX connector and the spikes disappeared.
If async-profiler CPU mode fails in your container, the first thing to check is the container's capabilities. If you can't add SYS_ADMIN, switch to -e itimer or use JFR.
Here's the container profiling trap I keep seeing: teams deploy JFR but never look at the recordings because they don't have JDK Mission Control in their workflow. Set up automated JFR dump collection. Have a cron job copy the last hour of JFR data to object storage. When an incident happens, you have the evidence waiting for you.
Another nuance: some Kubernetes platforms (OpenShift, GKE sandbox) block perf_event_open entirely. In that case, JFR is your only option. Always test your profiling toolchain on your specific container platform before production emergencies.
Use cat /sys/fs/cgroup/cpu/cpu.stat to see throttled time — if nr_throttled > 0, CFS is impacting your CPU profile.
A common mistake: assuming that a container with 2 CPU cores has full access to both. If the CPU limit is set, CFS throttles the container when it exceeds its quota. This shows up as __schedule in flame graphs. The fix is to either increase CPU limits or reduce allocation rates to stay under the throttling threshold.
Always test profiler permissions in your container platform before production. Use cat /sys/fs/cgroup/cpu/cpu.stat to see if CFS throttling is affecting your profile.
- JFR: works inside any container, but doesn't see other pods' resource usage.
- async-profiler perf_events: needs extra privileges; reflects container-limited CPU time.
- async-profiler itimer: works without privileges; timer-based, slightly less accurate but safe.
- Node-level tools (perf top, /sys/fs/cgroup): give the host-level perspective you need for cross-container contention.
-e itimer first.kubectl top pod.cat /sys/fs/cgroup/cpu/cpu.stat for throttled time.cat /sys/fs/cgroup/cpu/cpu.stat to see if CFS throttling is affecting your profile.-e itimer or switch to JFR.kubectl top pod and node-level metrics.perf top or cgroup stats on the host.Advanced Heap Dump Analysis with OQL
Eclipse MAT's Leak Suspects report is great for a first pass, but sometimes you need surgical precision. OQL (Object Query Language) is a SQL-like query language for heap dumps that lets you find specific objects, count instances, explore references, and even compute retained sizes programmatically.
OQL is available in Eclipse MAT and also in jhat (deprecated). In MAT, open the heap dump, then click the 'OQL' tab. Common queries:
SELECT * FROM java.util.HashMap— lists all HashMap instances (useful for caching leaks).SELECT toString(o), o.@usedHeapSize FROM java.lang.String o WHERE o.@usedHeapSize > 100000— find large strings that might be eating memory.SELECT * FROM io.thecodeforge.service.MyService s WHERE s.cache.@usedHeapSize > 500000000— checks if a specific service's cache exceeds 500MB.SELECT DISTINCT OBJECTS classOf(o) FROM OBJECTS (SELECT * FROM java.lang.Thread)— list all classes that hold references to threads (great for thread leak detection).
Production scenario: A team noticed heap growing slowly but the Leak Suspects report gave vague results. They ran an OQL query to find all objects of a specific logger class that had accumulated millions of entries due to a missing ttl. The OQL showed the exact count and the retaining call stack, leading to the fix within minutes.
OQL also supports path expressions: SELECT OBJECTS a FROM INSTANCEOF java.lang.ref.Finalizer a — shows all finalizable objects, a notorious source of delayed memory leaks.
Important: OQL queries can be slow on large dumps. Always filter with WHERE clauses and avoid unconstrained SELECT * on huge classes.
You can also use OQL to compute retained sizes programmatically without manually navigating the dominator tree. The @usedHeapSize pseudo-field is your friend.
Here's an OQL trick that's saved me hours: SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000. This finds individual map entries with large retained heaps — the exact objects you need to evict. Most leak investigations should start here, not with the Leak Suspects report.
One more advanced use: OQL's INSTRUMENTS clause lets you execute JavaScript-like expressions. For example, to find objects whose class name matches a regex: SELECT * FROM java.lang.Object o WHERE /Cache$/.test(o.class.name). This catches multiple cache implementations without spelling each one out.
Start your leak investigation with SELECT * FROM INSTANCEOF java.util.Map$Entry e WHERE e.@usedHeapSize > 1000000 — it finds the exact entries to evict.
OQL also allows grouping: SELECT c.name, sum(c.@usedHeapSize) FROM OBJECTS (SELECT * FROM java.lang.Object) o LET c = o.@class GROUP BY c.name ORDER BY sum(c.@usedHeapSize) DESC gives you a ranked list of classes by total retained heap. This is faster than the Histogram view for large dumps.
A trick: Use SELECT * FROM INSTANCEOF java.lang.ThreadLocal to find ThreadLocal instances that may hold large objects.
SELECT DISTINCT OBJECTS classOf(t.target) FROM INSTANCEOF java.lang.Thread t WHERE t.@usedHeapSize > 1000000 to find thread objects holding significant heap — a sign of thread-local accumulation.@usedHeapSize to get the retained size directly, avoiding manual dominator tree navigation.Performance Tuning Workflow: A Step-by-Step Production Example
Now let's put everything together with a real workflow. Suppose you have a payment processing service that's showing 99th percentile latency of 2 seconds during peak hours. Here's the exact sequence:
- Start JFR recording on one canary instance. Use the profile template for 5 minutes. This captures GC events, allocation rates, thread CPU, and lock contention.
- Load the JFR dump in JDK Mission Control. Go to the 'GC Pauses' view. If you see pauses >100ms, you have a GC problem. If the allocation rate is >500MB/s, you have an allocation problem.
- If GC is not the dominant issue, attach async-profiler for a wall-clock sample. Look for wide frames at the top. If you see
java.net.SocketInputStream.socketRead0wide, the service is waiting on network I/O. - If allocation is high, run
./profiler.sh -e alloc -d 30 -f alloc.html <pid>. The allocation flame graph will show which call sites create the most objects. - After identifying the hot spot, implement the fix (e.g., cache, pool, reduce object creation).
- Redeploy the canary and repeat steps 1-2 with the same JFR settings. Compare the new recording with the baseline.
This loop — profile, diagnose, fix, verify — is the only reliable way to tune performance. Guessing leads to wasted sprints.
Here's a Java example of a method that's a common allocation hotspot and its fix:
```java // Before: creates StringBuilder on every call public static String before(String prefix, int id) { return prefix + \":\" + id; // compiles to new StringBuilder().append()... }
// After: use String.format? No, that creates even more objects. // The real fix: if called thousands of times per second, inline explicitly: public static String after(String prefix, int id) { // Explicit concatenation — javac may optimize to StringBuilder anyway // But if prefix is constant, cache the template: return prefix + \":\" + id; // Real fix for high-frequency: pass parts directly, avoid intermediate strings } ```
Remember: always measure before and after. A change that looks smart on paper may not move the needle.
In practice, 80% of improvements come from the first two hot spots. Once you've addressed those, diminishing returns set in fast. Stop when you meet SLO — don't over-optimise.
One more production lesson: never trust a microbenchmark. The way your code runs in a JMH harness is completely different from how it runs under real load with GC, JIT warmup, and memory pressure. Always validate optimizations in production with the full profiling pipeline. If a change shows no improvement in the JFR comparison, revert it. Code complexity without performance gain is a net negative.
A story: One team spent a week optimizing a method that accounted for 5% of CPU; they missed the real bottleneck in JDBC pooling. Profiling first would have saved them days. Another team over-optimized by adding complex caching that caused memory pressure. They had to revert. The lesson: stop when SLO is met.
- Step 1: Start with JFR (default or profile) for 5 minutes on a canary.
- Step 2: Analyze the recording — find the top 3 hot spots by CPU, allocation, or latency.
- Step 3: Implement one change per iteration; never batch multiple optimizations.
- Step 4: Redeploy the canary and profile again with the same settings.
- Step 5: Compare before/after recordings. Did the hot spot shrink?
G1 GC Concurrent Mode Failure: Diagnosis and Tuning
Concurrent mode failure is G1 GC's worst-case scenario. It happens when the concurrent marking phase cannot finish before the old generation fills up. The JVM then falls back to a stop-the-world (STW) full GC, which compacts the entire heap and can pause application threads for seconds to tens of seconds.
The root cause is usually an allocation rate that exceeds the concurrent marker's throughput. G1's concurrent marking is designed to run in the background while the application continues. But if the application allocates faster than the marker can process, the heap occupancy rises past the threshold set by -XX:InitiatingHeapOccupancyPercent (IHOP, default 45%). Once occupancy exceeds IHOP, G1 triggers the concurrent cycle. If the cycle can't complete before the heap is completely full, concurrent mode failure occurs.
- GC logs will show 'Concurrent Mode Failure' followed by a full GC (e.g., 'Full GC (Allocation Failure)' with 'Pause Full (G1 Compaction Pause)').
- JFR recordings will show a long pause event of type 'G1 Pause Full' with duration in seconds.
- The allocation rate in the recording will be high, often >500MB/s.
Tuning knobs: 1. Reduce IHOP: Lower -XX:InitiatingHeapOccupancyPercent to start concurrent marking earlier. Values between 30% and 40% are common. This gives the marker more time to finish before the heap fills. 2. Increase heap size: More heap means more runway before saturation. If you have 4GB heap and allocate 500MB/s, the heap fills in ~8 seconds. With 8GB, you get ~16 seconds. 3. Increase heap region size (-XX:G1HeapRegionSize): Larger regions reduce the marking bitmap overhead and can speed up concurrent marking. For heaps >16GB, try 16MB or 32MB. 4. Increase the number of concurrent marking threads (-XX:ConcGCThreads): Default is often (ParallelGCThreads + 2) / 4. You can increase it, but be aware it steals CPU from application threads.
Production example: In the incident described earlier, the service had 4GB heap with default IHOP=45 and allocation rate ~300MB/s. The concurrent marking took ~8 seconds, but the heap filled in ~10 seconds. That 2-second gap was too tight. By reducing IHOP to 30, the concurrent cycle started earlier, and by increasing heap to 8GB, the filling time doubled to ~20 seconds, giving the marker plenty of room.
- Always monitor allocation rates along with GC logs. A sudden increase in allocation rate is an early warning.
- Set up alerts on 'Concurrent Mode Failure' in GC logs or JFR events.
- Use JFR's streaming API to trigger an alert if the allocation rate exceeds a threshold (e.g., 400MB/s) for more than 10 seconds.
Trade-off: Reducing IHOP means G1 spends more CPU on concurrent marking, which can reduce application throughput by 5-10% during marking. For most services, this is an acceptable trade-off to avoid STW pauses.
Remember: if you're seeing concurrent mode failure, the first thing to check is allocation rate. If it's >500MB/s, optimise allocation before tuning GC. The tuning buys time, but reducing allocation rate is the permanent fix.
The 2 AM Latency Spike: G1 Concurrent Mode Failure
- Never tune GC without first profiling allocation rates and pause frequencies.
- Monitor GC logs alongside application metrics — they tell the real story.
- Always test GC changes under production-like load; a config that works at 1k req/min may fail at 10k.
- Concurrent mode failure is the production killer; know your IHOP setting before you need to change it.
Common mistakes to avoid
5 patternsOptimizing without profiling first
Using jmap -histo without :live
jmap -histo:live or jcmd <pid> GC.heap_dump for a live dump that excludes unreachable objects.Ignoring wall-clock profiling
-e wall) to capture blocked threads and correlate with thread dumps.Tuning GC before measuring allocation rate
Assuming one heap dump tells the full story
Interview Questions on This Topic
Explain the difference between JFR and async-profiler. When would you use each?
Frequently Asked Questions
That's Advanced Java. Mark it forged?
25 min read · try the examples if you haven't