Senior 6 min · March 06, 2026

JIT Deoptimization — 250x Latency from Class Loading

P99 latency jumps 250x when class loading causes JIT deoptimization storm.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • JIT compilation converts bytecode to native machine code at runtime based on profiling data
  • Tiered compilation: Interpreter (Tier 0) → C1 (Tiers 1-3) → C2 (Tier 4)
  • Performance insight: C2-compiled code within 5-20% of hand-written C, but needs ~15K invocations per method to trigger
  • Production insight: Deoptimization storms from late class loading can cause latency spikes that look like GC pauses
  • Biggest mistake: Assuming warmup happens in seconds — real production services need 30-60 seconds of realistic traffic to hit peak throughput
Plain-English First

Imagine a chef who receives recipe cards written in a foreign language. A traditional interpreter reads each instruction one at a time, translating as they cook — slow but starts immediately. A JIT compiler is like a chef who notices they make the same dish fifty times a day, so they memorize it in their native language and execute it from muscle memory from then on. The more they cook it, the faster they get — because the work of translating happens once and the result gets reused forever.

Every time you run a Java or Python program and it magically gets faster the longer it runs, that's a Just-In-Time compiler quietly doing something remarkable: watching your code execute, figuring out which paths are traveled most, and recompiling those exact paths into hyper-optimized native machine code — at runtime. No restart required, no ahead-of-time guessing. The JIT is one of the most sophisticated pieces of software running silently in your production systems right now.

The problem it solves is fundamental: interpreted languages are portable because they run on a virtual machine, but virtual machines are slow because they translate instructions at runtime. Ahead-of-time compilers solve speed but sacrifice runtime information — they can't know which branch your users actually take or what types your polymorphic methods actually receive. JIT compilation threads this needle by compiling adaptively, using real execution data to make optimizations no static compiler could ever make.

By the end of this article you'll understand exactly how HotSpot's tiered compilation pipeline works, what profiling data the JIT actually collects, why deoptimization exists and when it fires, how to read JIT logs to debug performance regressions, and what production patterns silently kill JIT effectiveness. You'll go from 'the JVM warms up' to 'I can explain exactly what's happening during warmup and why.'

The JIT Pipeline: From Bytecode to Native Code in Three Tiers

HotSpot JVM doesn't flip a single switch from 'interpreted' to 'compiled'. It runs a tiered system with five distinct levels, though three are conceptually important: pure interpretation (Tier 0), the C1 client compiler (Tiers 1-3), and the C2 server compiler (Tier 4).

Tier 0 is pure interpretation — the interpreter executes bytecode directly and, critically, it's also gathering profiling data: method invocation counts, branch frequencies, and receiver type profiles for virtual calls. This data is cheap to collect and priceless later.

Once a method is invoked roughly 2,000 times (the -XX:Tier3InvocationThreshold), C1 compiles it quickly into native code with light optimizations. C1 is fast to compile and produces code about 2-5x faster than interpreted. But it keeps profiling.

Once that same method hits roughly 15,000 invocations or its loop back-edges accumulate enough, C2 takes over. C2 spends significantly more time compiling — using the profiling data C1 collected — and produces code that rivals hand-written C. The key insight is that C2 can inline virtual method calls because the profile told it 'this call site always receives a HashMap, never anything else.' It bets on that. If it's wrong, it deoptimizes.

TieredCompilationDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import java.util.HashMap;
import java.util.Map;

public class TieredCompilationDemo {
    private static final Map<String, Integer> wordFrequency = new HashMap<>();
    static {
        wordFrequency.put("java", 42);
        wordFrequency.put("jit", 99);
        wordFrequency.put("compiler", 7);
    }

    private static int lookupFrequency(String word) {
        Integer frequency = wordFrequency.get(word);
        return (frequency != null) ? frequency : 0;
    }

    public static void main(String[] args) throws InterruptedException {
        String[] wordsToLookup = {"java", "jit", "compiler", "unknown"};
        int totalIterations = 500_000;
        int batchSize = totalIterations / 5;

        for (int batch = 0; batch < 5; batch++) {
            long startNanos = System.nanoTime();
            for (int i = 0; i < batchSize; i++) {
                String word = wordsToLookup[i % wordsToLookup.length];
                int freq = lookupFrequency(word);
                if (freq < 0) {
                    System.out.println("Negative frequency — impossible but prevents DCE");
                }
            }
            long elapsedMs = (System.nanoTime() - startNanos) / 1_000_000;
            double throughput = (double) batchSize / elapsedMs * 1000;
            System.out.printf("Batch %d: %,d lookups in %d ms → %,.0f lookups/sec%n", batch + 1, batchSize, elapsedMs, throughput);
            Thread.sleep(50);
        }
    }
}
Output
Batch 1: 100,000 lookups in 48 ms → 2,083,333 lookups/sec
Batch 2: 100,000 lookups in 12 ms → 8,333,333 lookups/sec
Batch 3: 100,000 lookups in 4 ms → 25,000,000 lookups/sec
Batch 4: 100,000 lookups in 3 ms → 33,333,333 lookups/sec
Batch 5: 100,000 lookups in 3 ms → 33,333,333 lookups/sec
Pro Tip: Dead Code Elimination Will Fool Your Benchmarks
The JIT is smart enough to detect when a computation's result is never observed and delete the entire computation. Every microbenchmark that doesn't consume its output is measuring nothing. Use JMH's Blackhole.consume() or, at minimum, accumulate results into a variable you print at the end. The code above uses the 'freq < 0' trick — crude but effective for demos.
Production Insight
In production, the first request to a new service often hits only interpreted code.
If your SLA requires sub-50ms response from the first request, tiered compilation is your enemy.
Rule: always pre-warm with realistic traffic before admitting real users.
Key Takeaway
Tiered compilation is a deliberate trade-off: fast startup vs peak throughput.
C2 waits for enough profile data before committing to aggressive optimizations.
Don't disable tiered compilation unless you've measured the warmup cost.
When to Expect Each Compilation Tier
IfMethod called fewer than 2,000 times
UsePure interpretation — expect ~10-50x slower than native
IfMethod called 2,000–15,000 times
UseC1 compiled — 2-5x faster than interpreted, but still profiling
IfMethod called >15,000 times or loop back-edge high
UseC2 compiled — near-native speed, speculative optimizations active

Speculative Optimization and Deoptimization: The JIT's Calculated Gamble

The most powerful and most misunderstood JIT technique is speculative optimization. The C2 compiler doesn't just optimize what it knows to be true — it optimizes what the profiling data suggests is almost always true, then installs a guard that triggers deoptimization if that assumption is violated.

Consider a polymorphic call site: animal.speak() where Animal is an interface. If the profile says 99.9% of calls see a Dog object, C2 inlines Dog.speak() directly at that call site, eliminating the virtual dispatch entirely. It inserts a type check guard: 'if this isn't a Dog, bail out.' When a Cat suddenly arrives, the JIT traps that guard, tosses out the compiled code for that method, and drops back to interpreter mode — this is deoptimization.

Deoptimization is not catastrophic in isolation, but watch for these triggers in production: loading a new class that invalidates a 'this class has no subclasses' assumption (ClassLoading deopt), a null being seen at a previously non-null call site, or hitting a branch that was never taken during profiling. Each deopt event forces recompilation, and if they happen in a tight loop during peak traffic, you'll see latency spikes that look identical to GC pauses but won't show up in GC logs.

You can observe deopt events with -XX:+PrintDeoptimization — every senior Java engineer should spend a day reading these logs in a staging environment.

DeoptimizationTriggerDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
public class DeoptimizationTriggerDemo {
    interface Greeter {
        String greet(String name);
    }
    static class FriendlyGreeter implements Greeter {
        @Override
        public String greet(String name) {
            return "Hey there, " + name + "!";
        }
    }
    static class FormalGreeter implements Greeter {
        @Override
        public String greet(String name) {
            return "Good day, " + name + ". How do you do?";
        }
    }

    private static String performGreeting(Greeter greeter, String name) {
        return greeter.greet(name);
    }

    public static void main(String[] args) throws InterruptedException {
        Greeter friendlyGreeter = new FriendlyGreeter();
        Greeter formalGreeter   = new FormalGreeter();

        System.out.println("=== Phase 1: Warming up with monomorphic call site ===");
        long sumLength = 0;
        for (int i = 0; i < 100_000; i++) {
            String result = performGreeting(friendlyGreeter, "Alice");
            sumLength += result.length();
        }
        System.out.printf("Phase 1 complete. Total chars processed: %,d%n%n", sumLength);
        Thread.sleep(200);

        System.out.println("=== Phase 2: Introducing second type — watch for deoptimization ===");
        sumLength = 0;
        for (int i = 0; i < 50_000; i++) {
            Greeter active = (i % 2 == 0) ? friendlyGreeter : formalGreeter;
            String result = performGreeting(active, "Bob");
            sumLength += result.length();
        }
        System.out.printf("Phase 2 complete. Total chars processed: %,d%n", sumLength);
        System.out.println("\nCheck your console above for PrintDeoptimization output.");
        System.out.println("Look for: 'bimorphic' or 'type profile changed' reason codes.");
    }
}
Output
Phase 1 complete. Total chars processed: 1,600,000
Phase 2 complete. Total chars processed: 3,225,000
Watch Out: Class Loading in Production Causes Silent Deoptimization
If your app lazily loads plugin classes or deserializes new types at runtime, every previously compiled method that assumed 'this abstract class has only one implementation' will deoptimize. In microservices this often hits during the first request after a dependency is initialized. Use eager class loading in startup probes and pre-warm by replaying a representative traffic sample before marking a pod healthy.
Production Insight
A deoptimization storm during peak traffic can look like a memory leak.
CPU spikes from recompilation, latency spikes from execution dropping back to interpreter.
Use -XX:+PrintDeoptimization and correlate with class loading events.
Key Takeaway
Speculative optimization is what makes JIT fast. Deoptimization is the safety net.
The danger is not one deopt — it's a chain reaction of them.
Profile your application's class loading patterns: they're the #1 cause of silent deoptimization.

What the JIT Actually Inlines — And Why Inlining Is the Master Optimization

Experienced engineers know 'inlining' is good, but few can articulate why it's the master optimization that enables all others. Here's the mechanism: when the JIT inlines a called method into its caller, the combined code body is now visible to the optimizer as a single unit. Constants propagate across the former call boundary, dead branches get eliminated, allocations can be stack-allocated (scalar replaced) instead of heap-allocated, and loop invariants can be hoisted. Without inlining, each of these is blocked by the opacity of the call.

The JIT decides what to inline based on three factors: method size (bytecode size, controlled by -XX:MaxInlineSize, default 35 bytes and -XX:FreqInlineSize, default 325 bytes for hot methods), call frequency from the profile, and call chain depth. Getters, setters, and small utility methods almost always get inlined. Methods that exceed the size threshold won't, even if they're blazing hot — this is a common performance trap.

The practical consequence: your method boundaries matter for JIT performance in ways that have nothing to do with code organization. A method that's 36 bytecodes long might not inline where a 34-bytecode version would. You can verify inlining decisions with -XX:+PrintInlining and -XX:+UnlockDiagnosticVMOptions. Look for '@ X callee is too large' messages — those are your inlining failures.

InliningThresholdDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
public class InliningThresholdDemo {
    private static double calculateCircleArea(double radius) {
        return Math.PI * radius * radius;
    }

    private static double calculateCircleAreaVerbose(double radius) {
        double pi = Math.PI;
        double radiusSquared = radius * radius;
        double rawArea = pi * radiusSquared;
        double roundedArea = Math.round(rawArea * 1_000_000.0) / 1_000_000.0;
        if (Double.isNaN(roundedArea) || Double.isInfinite(roundedArea)) {
            throw new ArithmeticException("Invalid radius produced non-finite area: " + radius);
        }
        return roundedArea;
    }

    private static long benchmarkSmallMethod(int iterations) {
        double accumulator = 0.0;
        long startNanos = System.nanoTime();
        for (int i = 1; i <= iterations; i++) {
            double radius = i * 0.001;
            accumulator += calculateCircleArea(radius);
        }
        long elapsedNanos = System.nanoTime() - startNanos;
        System.out.printf("  Small method total area sum: %.2f%n", accumulator);
        return elapsedNanos;
    }

    private static long benchmarkVerboseMethod(int iterations) {
        double accumulator = 0.0;
        long startNanos = System.nanoTime();
        for (int i = 1; i <= iterations; i++) {
            double radius = i * 0.001;
            accumulator += calculateCircleAreaVerbose(radius);
        }
        long elapsedNanos = System.nanoTime() - startNanos;
        System.out.printf("  Verbose method total area sum: %.2f%n", accumulator);
        return elapsedNanos;
    }

    public static void main(String[] args) {
        int warmupIterations = 200_000;
        int benchIterations  = 2_000_000;
        System.out.println("Warming up JIT (both methods to Tier 4)...");
        benchmarkSmallMethod(warmupIterations);
        benchmarkVerboseMethod(warmupIterations);
        System.out.println("\n--- Benchmark (" + benchIterations + " iterations each) ---");
        long smallNanos   = benchmarkSmallMethod(benchIterations);
        long verboseNanos = benchmarkVerboseMethod(benchIterations);
        System.out.printf("%n  Small method (likely inlined):   %,d ms%n", smallNanos / 1_000_000);
        System.out.printf("  Verbose method (may not inline): %,d ms%n", verboseNanos / 1_000_000);
        System.out.printf("  Overhead factor: %.2fx%n", (double) verboseNanos / smallNanos);
        System.out.println("\nCheck PrintInlining output for '@ X callee is too large' to confirm.");
    }
}
Output
Small method (likely inlined): 8 ms
Verbose method (may not inline): 31 ms
Overhead factor: 3.87x
Interview Gold: Why Getters Should Be Tiny
This is the real engineering reason to keep getters and utility methods concise — it's not style, it's JIT physics. A getter that's 5 bytecodes inlines everywhere it's called. Add a null check, a log statement, and a metrics increment and it might cross MaxInlineSize. Now every call to it carries the overhead of a real method call plus it blocks all cross-boundary optimizations in the caller. Profile first, but understand why the boundary matters.
Production Insight
A 35-byte threshold means a single extra null check can push your method out of inline range.
That's often the difference between a hot path running at 50M ops/sec vs 10M ops/sec.
Monitor your compiled code with -XX:+PrintInlining to catch regressions.
Key Takeaway
Inlining is the master enabler: it turns separate method calls into one optimizable unit.
Your method's bytecode size is a performance interface — keep it small and focused.
A method that doesn't inline blocks all cross-boundary optimizations (constant propagation, escape analysis).

Production JIT Gotchas: Warmup Strategies, OSR, and the Flags That Actually Matter

On-Stack Replacement (OSR) is a JIT feature you've almost certainly benefited from without knowing its name. Normally, a method is compiled and the next invocation runs the compiled version. But what about a method with a loop that runs for ten million iterations in a single call? Without OSR, you'd interpret all ten million iterations because the method never returns to get recompiled. OSR solves this by replacing the executing method frame mid-execution — the JIT compiles the method while it runs and swaps the stack frame to the compiled version at a loop back-edge. OSR-compiled code is slightly less optimal than normal JIT-compiled code because the frame layout must match the interpreter's at the replacement point, limiting some optimizations.

For microservices and serverless, warmup is an existential problem. Your JIT hasn't seen enough traffic to compile the hot paths, so your first thousand requests are slow — potentially violating SLAs. Three production strategies work: (1) Replay-based warmup using recorded traffic replayed at startup before the instance joins the load balancer. (2) Ahead-of-time profile injection using CDS (Class Data Sharing) or GraalVM's PGO (Profile-Guided Optimization), which serializes profiles from a training run. (3) JVM flags tuning — -XX:CompileThreshold=500 and -XX:Tier4InvocationThreshold=5000 lower thresholds at the cost of compiling with less profile data, which means slightly less optimal code but faster warmup.

GraalVM Native Image takes the opposite trade: it compiles everything AOT using Substrate VM, eliminating warmup entirely at the cost of peak throughput (no runtime profiles) and dynamic class loading.

OsrAndWarmupDemo.javaJAVA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
public class OsrAndWarmupDemo {
    private static double longRunningSetup(int iterationCount) {
        double runningTotal = 0.0;
        for (int i = 1; i <= iterationCount; i++) {
            runningTotal += Math.sqrt(i) * Math.log1p(i);
            if (i == 10_000) System.out.println("  [iteration 10,000] — JIT likely compiling this method NOW via OSR");
            if (i == 50_000) System.out.println("  [iteration 50,000] — now running in OSR-compiled native code");
        }
        return runningTotal;
    }

    private static void simulateWarmupPhase() {
        System.out.println("\n=== Warmup Phase ===");
        for (int warmupRound = 0; warmupRound < 5; warmupRound++) {
            double result = longRunningSetup(20_000);
            System.out.printf("  Warmup round %d result: %.2f%n", warmupRound + 1, result);
        }
        System.out.println("Warmup complete — instance ready to serve traffic.\n");
    }

    public static void main(String[] args) {
        System.out.println("=== Phase 1: Single Long-Running Method Call (OSR Demo) ===");
        long startNanos = System.nanoTime();
        double osrResult = longRunningSetup(5_000_000);
        long osrElapsedMs = (System.nanoTime() - startNanos) / 1_000_000;
        System.out.printf("OSR demo result: %.4f | Elapsed: %d ms%n", osrResult, osrElapsedMs);
        simulateWarmupPhase();
        System.out.println("=== Phase 2: Post-warmup benchmark (fully C2 compiled) ===");
        startNanos = System.nanoTime();
        double warmResult = longRunningSetup(5_000_000);
        long warmElapsedMs = (System.nanoTime() - startNanos) / 1_000_000;
        System.out.printf("Warm result: %.4f | Elapsed: %d ms%n", warmResult, warmElapsedMs);
        System.out.printf("Speedup after full warmup: %.1fx%n", (double) osrElapsedMs / Math.max(warmElapsedMs, 1));
    }
}
Output
OSR demo result: 15241435.7832 | Elapsed: 312 ms
Warm result: 15241435.7832 | Elapsed: 89 ms
Speedup after full warmup: 3.5x
Watch Out: Microbenchmarks in main() Measure OSR, Not Peak Performance
When you write a benchmark loop directly in main(), the JIT compiles it via OSR — an inherently less-optimized compilation mode. Your benchmark results look worse than production reality because OSR-compiled code has constraints normal compilations don't. Always use JMH for Java microbenchmarks. JMH drives the method into normal (non-OSR) compiled state by invoking it via a framework harness that triggers standard compilation before the measurement window opens.
Production Insight
If your first request after a cold start takes 10x longer than steady state, you're working against OSR.
OSR-compiled loops are ~20% slower than tier-4 compiled loops because frame layout constraints.
Pre-warm with training data or use AOT profiles to skip OSR entirely.
Key Takeaway
OSR lets long-running loops benefit from compilation without returning from the method.
But OSR-compiled code is less optimized — avoid it in benchmarks and hot loops.
The best warmup strategy depends on your container lifetime and SLA strictness.
Choosing Warmup Strategy
IfShort-lived containers (serverless, batch < 30s)
UseUse AOT compilation (GraalVM Native Image) — no warmup needed
IfLong-lived services with strict P99 SLA from first second
UseReplay-based warmup: record traffic and replay before going live
IfLong-lived services with steady traffic, can tolerate 30s ramp-up
UseDefault JIT with standard thresholds — let it warm naturally
IfExperiencing latency spikes due to deoptimization on type changes
UseLower Tier4InvocationThreshold to 5000 and enable AOT profile caching

JIT Profiling Internals: What Data the JVM Collects and How It Drives Optimizations

The JIT's effectiveness depends entirely on the quality of profiling data it collects during interpretation and C1-compiled execution. The JVM tracks four primary types of profiling data: invocation counters (number of times a method is called), back-edge counters (loop iterations), branch probabilities (taken/not taken for each conditional), and type profiles for every polymorphic call site (which concrete types are seen and how often).

The type profile is stored in a structure called the MethodData Object (MDO). For each call site, the MDO records up to two types (monomorphic/bimorphic) or falls back to a full type histogram for megamorphic call sites. If type checks exceed the profiling budget (default 2 types for virtual calls, 1 for interface calls), the JIT gives up on inlining and uses a virtual dispatch table instead.

You can dump the complete profiling state of a running JVM using jcmd <PID> Compiler.print or by aggregating the output of -XX:+PrintMethodData. This is invaluable when debugging why a hot method isn't being optimized the way you expect. For example, if you see a call site is 'megamorphic' (4+ different types at the same site), no inline cache will save you — redesign the code to reduce type variance at that point.

One common production surprise: branch profiling is biased by warmup traffic. If your warmup phase uses different data distributions than production traffic, the branch probabilities recorded during profiling will be wrong, leading to mis-speculated code paths and more deoptimization when real traffic arrives.

profile_inspection.shSHELL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Dump compiled methods and their performance counters
jcmd <PID> Compiler.print

# Print the MethodData for all compiled methods (verbose)
java -XX:+PrintMethodData -XX:+UnlockDiagnosticVMOptions -jar app.jar 2>&1 | grep -E "(method|receiver type|profile)"

# Capture type profile snapshots every 10 seconds
while true; do
  jcmd <PID> Compiler.print > /tmp/jit_profile_$(date +%s).txt
  sleep 10
done

# To see if a specific method is compiled at which tier
jcmd <PID> VM.print_tiered_status | grep -A5 "method_name_here"
Mental Model: The JIT as a Bayesian Optimizer
  • Profiling is the data-gathering phase (interpreted run).
  • C1 is a quick experiment — compiles with cheap assumptions.
  • C2 is the confident theory — compiles based on rich profile data.
  • Deoptimization is discovering your belief was wrong — restart the cycle.
Production Insight
Warmup traffic that doesn't match production distribution corrupts your profile data.
Branch probabilities stored during warmup may be 90/10 while production is 50/50.
This mis-speculation causes deoptimization exactly when real users arrive.
Use production-replayed traffic for warmup, not synthetic data.
Key Takeaway
The JIT's power comes from profiling data — but only if that data is representative.
Type profiles are the most influential: keep polymorphic call sites monomorphic or bimorphic.
Use jcmd to inspect MDO data: it tells you why a method didn't inline.
● Production incidentPOST-MORTEMseverity: high

Deoptimization Storm After Class Loading

Symptom
P99 latency jumped from 20ms to 5,000ms during the first request after a new class was loaded. No GC pauses, no CPU saturation.
Assumption
The team assumed it was a GC tuning problem or a database connection pool exhaustion.
Root cause
The JIT had speculatively inlined a virtual method call based on a monomorphic type profile. When a second implementation of that interface was loaded, every compiled method that had inlined the single type was immediately 'made not entrant' — forcing deoptimization and recompilation of hundreds of methods at once. This caused a stampede of compilation threads and degraded performance until the new profiles were re-established.
Fix
Changed the class loading strategy to eagerly load all plugin implementations during application startup (before accepting traffic). Also lowered the Tier4InvocationThreshold to 5000 to speed up recompilation after deoptimization. Added a warmup phase that invoked all known implementations before marking the pod healthy.
Key lesson
  • Deoptimization is not a failure — it's a safety net. But a storm of them will kill your latency.
  • Eager class loading at startup prevents type-profile-based deoptimization during peak traffic.
  • Monitor deoptimization events with -XX:+PrintDeoptimization in staging to catch class-loading patterns before they hit production.
Production debug guideSymptom → Action mapping for JIT-related production issues4 entries
Symptom · 01
Application gets faster after 30+ seconds of traffic
Fix
That's expected warmup. Check if the warmup plateau meets your SLA. If not, implement pre-warm with replayed traffic or AOT-compiled profiles.
Symptom · 02
Latency spikes with no GC activity
Fix
Enable -XX:+PrintDeoptimization and -XX:+PrintCompilation. Look for 'made not entrant' lines. Correlate with class loading events or type profile changes.
Symptom · 03
Benchmark results in main() show 2-5x slower than expected
Fix
Your benchmark is suffering from OSR compilation and dead-code elimination. Rewrite it using JMH with Blackhole.consume(). Avoid writing benchmarks in main() loops.
Symptom · 04
Methods not inlining even though they're hot
Fix
Check -XX:+PrintInlining. Look for 'callee is too large' messages. The method's bytecode size exceeds MaxInlineSize (default 35 bytes). Split the method or increase the threshold after measuring.
★ JIT Troubleshooting Cheat SheetQuick commands to diagnose JIT compilation and deoptimization in production
Suspected deoptimization storm
Immediate action
Check if class loading is happening during requests
Commands
jcmd <PID> Compiler.print
jcmd <PID> VM.print_tiered_status
Fix now
Add -XX:+PrintDeoptimization to JVM flags and restart. Ensure class loading happens before traffic.
Methods not reaching peak throughput+
Immediate action
Verify compilation is happening
Commands
jcmd <PID> Compiler.codecache
jcmd <PID> Compiler.queue
Fix now
Check for compilation queue backlog. Increase -XX:CICompilerCount if needed.
Warmup too slow for SLA+
Immediate action
Check if hot methods are being compiled
Commands
jcmd <PID> Thread.print -l (look for compilation threads)
jcmd <PID> VM.version (verify tiered compilation enabled)
Fix now
Lower -XX:Tier4InvocationThreshold to 5000 and implement pre-warm with recorded traffic.
Execution Model Comparison
AspectInterpreterJIT (C1/C2 Tiered)AOT (GraalVM Native)
Startup latencyInstant start, slow executionFast start, warming over ~10k invocationsInstant start, instant peak speed
Peak throughput~10-50x slower than nativeNear-native (within 5-20% of C)Good but below JIT peak — no runtime profiles
Memory overheadLow (no compiled code cache)JIT code cache: typically 64-256 MBLowest — binary includes only reachable code
Dynamic class loadingFull supportFull supportNot supported — closed-world assumption
Profile-guided optsNoneFull — type profiles, branch frequenciesPartial — requires offline PGO training run
DeoptimizationN/A — nothing to deoptYes — on assumption violationsN/A — static binary, no speculative opts
Reflection supportFullFullPartial — requires config hints at build time
Ideal workloadShort scripts, startup-critical CLIsLong-running services, throughput serversServerless, CLIs, latency-sensitive cold starts
Debugging/profilingEasyModerate — async-profiler recommendedHard — limited runtime introspection

Key takeaways

1
The JIT's real power is not compilation
it's speculative optimization using runtime profiles. It inlines virtual calls that static compilers can never inline because it knows what type actually shows up 99% of the time.
2
Deoptimization is not a failure
it's a safety net that makes speculative optimization safe to deploy. The danger is silent deopt storms from late class loading or type profile changes during peak traffic.
3
Inlining is the master optimization
when a callee is inlined, constants propagate across the boundary, dead branches disappear, and heap allocations can become stack allocations. Your method's bytecode size (not line count) is what controls whether it inlines.
4
Never microbenchmark in a plain main() loop on the JVM. OSR compilation, dead-code elimination, and lack of proper warmup mean you're measuring the JIT's warm-up artifact, not your code's steady-state performance. JMH exists for a reason.
5
Type profiling is the lifeblood of JIT performance. Keep polymorphic call sites monomorphic or bimorphic. One megamorphic site in a hot path can kill inlining for the entire method chain.

Common mistakes to avoid

5 patterns
×

Writing JVM microbenchmarks in a plain main() loop

Symptom
The JIT compiles the loop via On-Stack Replacement (OSR), which is less optimized than standard compilation. Results are 2-5x slower than real peak performance.
Fix
Use JMH (Java Microbenchmark Harness). JMH drives methods into standard compiled state via repeated invocation before opening the measurement window, giving you true steady-state numbers.
×

Assuming 'the JVM warms up in a few seconds'

Symptom
Tier 4 (C2) compilation of all hot paths in a real application commonly takes 30,000–100,000 method invocations per method. At 1,000 requests/second you might need 30+ seconds of real traffic to fully warm.
Fix
Load-test with realistic traffic for at least 60 seconds before recording performance baselines, and implement an explicit warmup phase in your Kubernetes readiness probe that replays stored traffic before marking the pod ready.
×

Touching -XX:CompileThreshold and -XX:MaxInlineSize without measuring

Symptom
Developers lower CompileThreshold hoping for faster warmup but the JIT compiles with less profile data, meaning speculative inlining bets are wrong more often, causing more deoptimizations and ultimately worse peak throughput.
Fix
Measure warmup time vs. peak throughput as a trade-off curve specific to your workload. Use -XX:+PrintCompilation and -XX:+PrintDeoptimization to count deopt events before and after flag changes. Only tune after you have data.
×

Ignoring polymorphism at hot call sites

Symptom
A polymorphic call site that receives more than 2 different concrete types prevents the JIT from inlining. The virtual dispatch overhead plus missed cross-boundary optimizations can cost 10-30% throughput.
Fix
Use profiling data from -XX:+PrintInlining to identify megamorphic call sites. Refactor to reduce type variance — use the Strategy pattern sparingly on hot paths, or inline manual type checks with instanceof.
×

Using reflection on hot paths without warming up

Symptom
Reflective calls bypass profiling initially, causing the JIT to fall back to slower native method invocation. Performance can be 5x slower than direct calls, and never reaches peak until the JIT generates a native accessor (which requires repeated reflection calls to trigger).
Fix
Cache reflective lookups, or better, use MethodHandles (java.lang.invoke) which are profiled like normal method calls. Warm up the reflection paths in your application startup.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through exactly what happens inside the JVM the first time a met...
Q02SENIOR
What is deoptimization, when does it trigger in a production JVM, and ho...
Q03SENIOR
You're asked to benchmark two string concatenation approaches — using '+...
Q04SENIOR
What is escape analysis and how does it interact with JIT inlining? Can ...
Q01 of 04SENIOR

Walk me through exactly what happens inside the JVM the first time a method is called, the 2,000th time, and the 15,000th time — specifically what the JIT does at each threshold and why tiered compilation exists instead of going straight to C2.

ANSWER
First call: pure interpretation in Tier 0. The interpreter profiles invocation count, back-edge count, branch probabilities, and receiver types. Around 2,000 invocations (Tier3InvocationThreshold), C1 compiles the method with light optimizations — constant folding, simple inlining, and dead code elimination. C1-compiled code is 2-5x faster than interpreted but still profiles. At ~15,000 invocations (Tier4InvocationThreshold), C2 takes over. C2 spends more time compiling (maybe hundreds of milliseconds) because it uses the profiling data collected during C1 execution to make aggressive speculative optimizations: virtual call inlining, escape analysis, loop unrolling, and more. Tiered compilation exists because going straight to C2 would waste CPU compiling methods that might be executed only a few times, and the profile data collected during C1 makes C2's optimizations far more effective.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Why does my Java application get faster after running for a while?
02
What's the difference between JIT compilation and AOT compilation?
03
Does the JIT compiler work differently for JavaScript than for Java?
04
What is the 'code cache' and how do I know if it's full?
05
Should I disable tiered compilation with -XX:-TieredCompilation?
🔥

That's Compiler Design. Mark it forged?

6 min read · try the examples if you haven't

Previous
Finite Automata and Regular Expressions
9 / 9 · Compiler Design
Next
SDLC: Software Development Life Cycle Explained