Advanced 10 min · March 06, 2026

Just-In-Time Compilation

JIT Deoptimization — 250x Latency from Class Loading

Q: Why does my Java application get faster after running for a while?

This is JIT compilation kicking in. The JVM starts by interpreting your bytecode while collecting profiling data about which methods run most and what types they receive. Once a method crosses invocation thresholds (~2,000 for C1, ~15,000 for C2), the JIT compiles it to native machine code using those profiles for aggressive optimization. The process typically plateaus after 30-60 seconds of realistic traffic.

Q: What's the difference between JIT compilation and AOT compilation?

JIT compiles code at runtime using actual execution profiles, enabling speculative optimizations like virtual call inlining that no static compiler can safely perform. AOT (like GraalVM Native Image) compiles everything to a native binary before execution, giving instant startup and no warmup cost but losing the ability to optimize based on actual runtime behavior. For long-running throughput servers, JIT typically wins on peak performance. For CLIs and serverless, AOT wins on startup latency.

Q: Does the JIT compiler work differently for JavaScript than for Java?

The high-level strategy is similar — profile hot paths and compile them to native code — but the challenges differ dramatically. JavaScript is dynamically typed, so the JIT must profile type shapes of objects and deoptimize aggressively when shapes change. V8's Ignition interpreter feeds Turbofan with type feedback just as HotSpot's interpreter feeds C2. The key difference is that Java's static type system gives the JIT much stronger guarantees from the start, while JavaScript JITs must be far more defensive about deoptimization.

Q: What is the 'code cache' and how do I know if it's full?

The code cache is the memory region where the JVM stores compiled native code. If it fills up, the JVM stops compiling new methods (or even deoptimizes existing ones to free space). Monitor it with -XX:+PrintCodeCache in GC logs or jcmd Compiler.codecache. Default size is typically 240 MB. If you see 'code cache full' warnings, increase -XX:ReservedCodeCacheSize. A full code cache severely degrades throughput.

Q: Should I disable tiered compilation with -XX:-TieredCompilation?

Almost never. Disabling tiered compilation forces the JVM to go directly to C2 (or C1 depending on -XX:-TieredCompilation and -client flag). This increases startup latency and wastes CPU compiling cold methods. It may improve peak throughput slightly for trivial benchmarks but hurts real applications. Only disable tiered compilation if you've measured and confirmed that the compile time investment pays off for your specific workload.

P99 latency jumps 250x when class loading causes JIT deoptimization storm.

Naren Founder & Principal Engineer

20+ years shipping production systems from the metal up. Lessons pulled from things that broke in production.

✓ Production

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

JIT compilation converts bytecode to native machine code at runtime based on profiling data
Tiered compilation: Interpreter (Tier 0) → C1 (Tiers 1-3) → C2 (Tier 4)
Performance insight: C2-compiled code within 5-20% of hand-written C, but needs ~15K invocations per method to trigger
Production insight: Deoptimization storms from late class loading can cause latency spikes that look like GC pauses
Biggest mistake: Assuming warmup happens in seconds — real production services need 30-60 seconds of realistic traffic to hit peak throughput

✦ Definition~90s read

What is Just-In-Time Compilation?

Just-in-time (JIT) compilation is the runtime engine that converts JVM bytecode into native machine code during program execution, trading upfront compilation cost for massive throughput gains on hot code paths. Unlike static compilation (e.g., C++ with GCC), JIT observes actual runtime behavior — branch frequencies, type profiles, monomorphic call sites — and uses that data to make aggressive, speculative optimizations that no ahead-of-time compiler can match.

★

Imagine a chef who receives recipe cards written in a foreign language.

The JVM's C1 (client) and C2 (server) compilers operate in a tiered pipeline: bytecode starts interpreted, graduates to C1 for quick profiling, and finally C2 applies heavy optimizations like inlining, lock coarsening, and loop unrolling. This is why a warmed-up JVM can outperform naively compiled C++ on certain workloads — the JIT adapts to your actual data.

The dark side is deoptimization: when the JIT's speculative assumptions (e.g., 'this virtual call always targets class A') break at runtime — say, a new class loads that implements the interface — the JVM must roll back to an earlier safe state, often at the next safepoint. This 'bailout' can spike latency by 250x or more because the JIT discards optimized native code and falls back to interpreted execution or a lower-tier compilation.

In production, this manifests as mysterious p99 spikes after a classloader operation (e.g., dynamic proxy generation, Spring context refresh, or Groovy script loading). The JIT's own profiling data — method invocation counters, backedge counts for OSR (on-stack replacement), and inline cache hit rates — drives these decisions, but it's a black box unless you instrument with -XX:+PrintCompilation and -XX:+TraceDeoptimization.

You should care about JIT internals when tuning latency-sensitive services (trading, ad serving, real-time APIs). Warmup strategies — like using -XX:CompileThreshold or Azul's C4 collector with aggressive tiering — can pre-compile known hot paths. But the real gotcha is class loading: every time you load a new class that extends an interface or overrides a method, you risk invalidating inlined call sites.

Tools like JITWatch or the JFR event jdk.Deoptimization let you see these events in production. If you're running a polyglot JVM (Groovy, JRuby, Scala) or heavy reflection, expect deoptimization storms. The alternative is GraalVM's native image (AOT compilation), which eliminates JIT entirely but sacrifices adaptive optimization — a tradeoff you make when startup time or peak latency predictability matters more than raw throughput.

Plain-English First

Imagine a chef who receives recipe cards written in a foreign language. A traditional interpreter reads each instruction one at a time, translating as they cook — slow but starts immediately. A JIT compiler is like a chef who notices they make the same dish fifty times a day, so they memorize it in their native language and execute it from muscle memory from then on. The more they cook it, the faster they get — because the work of translating happens once and the result gets reused forever.

Every time you run a Java or Python program and it magically gets faster the longer it runs, that's a Just-In-Time compiler quietly doing something remarkable: watching your code execute, figuring out which paths are traveled most, and recompiling those exact paths into hyper-optimized native machine code — at runtime. No restart required, no ahead-of-time guessing. The JIT is one of the most sophisticated pieces of software running silently in your production systems right now.

The problem it solves is fundamental: interpreted languages are portable because they run on a virtual machine, but virtual machines are slow because they translate instructions at runtime. Ahead-of-time compilers solve speed but sacrifice runtime information — they can't know which branch your users actually take or what types your polymorphic methods actually receive. JIT compilation threads this needle by compiling adaptively, using real execution data to make optimizations no static compiler could ever make.

By the end of this article you'll understand exactly how HotSpot's tiered compilation pipeline works, what profiling data the JIT actually collects, why deoptimization exists and when it fires, how to read JIT logs to debug performance regressions, and what production patterns silently kill JIT effectiveness. You'll go from 'the JVM warms up' to 'I can explain exactly what's happening during warmup and why.'

JIT Compilation: The Hot Path Optimizer That Can Also Burn You

Just-in-time (JIT) compilation is a runtime technique that converts bytecode into native machine code during program execution, targeting only frequently executed code paths. The core mechanic: the JVM profiles method invocations and loop iterations, then compiles the hot spots — typically methods called more than 10,000 times — into optimized native code. This gives Java near-native performance while preserving portability.

In practice, the JIT compiler uses tiered compilation: first interpreting bytecode, then compiling with C1 (quick, minimal optimization), and finally with C2 (aggressive, profile-guided optimization). The key property: compilation happens asynchronously on a background thread, so your application keeps running during compilation. But deoptimization — reverting to interpreted code — can happen when assumptions made during compilation are invalidated, such as when a new class is loaded that changes the class hierarchy.

Use JIT when you need both portability and performance — essentially any server-side Java application. It matters because the difference between interpreted and JIT-compiled code can be 10-100x on hot paths. However, the cost of deoptimization can spike latency by 250x in pathological cases, making it critical to understand what triggers recompilation.

⚠ Deoptimization Is Not a Bug

JIT deoptimization is a correctness mechanism, not a performance bug — but it can amplify latency when triggered by unexpected class loading in production.

📊 Production Insight

Teams using dynamic class loading (e.g., plugin systems, scripting engines) often see latency spikes after a new plugin is deployed.

The symptom: P99 latency jumps from 2ms to 500ms for several seconds as the JIT recompiles all affected methods.

Rule of thumb: warm up your JIT with representative class loads before serving traffic, or pin critical methods with -XX:CompileCommand.

🎯 Key Takeaway

JIT compiles only hot paths — cold code runs interpreted, so benchmark after warmup.

Deoptimization can cause 250x latency spikes when class hierarchy assumptions break.

Profile-guided optimization means performance is workload-dependent — test with production traffic patterns.

thecodeforge.io

Just In Time Compilation

The JIT Pipeline: From Bytecode to Native Code in Three Tiers

HotSpot JVM doesn't flip a single switch from 'interpreted' to 'compiled'. It runs a tiered system with five distinct levels, though three are conceptually important: pure interpretation (Tier 0), the C1 client compiler (Tiers 1-3), and the C2 server compiler (Tier 4).

Tier 0 is pure interpretation — the interpreter executes bytecode directly and, critically, it's also gathering profiling data: method invocation counts, branch frequencies, and receiver type profiles for virtual calls. This data is cheap to collect and priceless later.

Once a method is invoked roughly 2,000 times (the -XX:Tier3InvocationThreshold), C1 compiles it quickly into native code with light optimizations. C1 is fast to compile and produces code about 2-5x faster than interpreted. But it keeps profiling.

Once that same method hits roughly 15,000 invocations or its loop back-edges accumulate enough, C2 takes over. C2 spends significantly more time compiling — using the profiling data C1 collected — and produces code that rivals hand-written C. The key insight is that C2 can inline virtual method calls because the profile told it 'this call site always receives a HashMap, never anything else.' It bets on that. If it's wrong, it deoptimizes.

TieredCompilationDemo.javaJAVA

import java.util.HashMap;
import java.util.Map;

public class TieredCompilationDemo {
    private static final Map<String, Integer> wordFrequency = new HashMap<>();
    static {
        wordFrequency.put("java", 42);
        wordFrequency.put("jit", 99);
        wordFrequency.put("compiler", 7);
    }

    private static int lookupFrequency(String word) {
        Integer frequency = wordFrequency.get(word);
        return (frequency != null) ? frequency : 0;
    }

    public static void main(String[] args) throws InterruptedException {
        String[] wordsToLookup = {"java", "jit", "compiler", "unknown"};
        int totalIterations = 500_000;
        int batchSize = totalIterations / 5;

        for (int batch = 0; batch < 5; batch++) {
            long startNanos = System.nanoTime();
            for (int i = 0; i < batchSize; i++) {
                String word = wordsToLookup[i % wordsToLookup.length];
                int freq = lookupFrequency(word);
                if (freq < 0) {
                    System.out.println("Negative frequency — impossible but prevents DCE");
                }
            }
            long elapsedMs = (System.nanoTime() - startNanos) / 1_000_000;
            double throughput = (double) batchSize / elapsedMs * 1000;
            System.out.printf("Batch %d: %,d lookups in %d ms → %,.0f lookups/sec%n", batch + 1, batchSize, elapsedMs, throughput);
            Thread.sleep(50);
        }
    }
}

Output

Batch 1: 100,000 lookups in 48 ms → 2,083,333 lookups/sec

Batch 2: 100,000 lookups in 12 ms → 8,333,333 lookups/sec

Batch 3: 100,000 lookups in 4 ms → 25,000,000 lookups/sec

Batch 4: 100,000 lookups in 3 ms → 33,333,333 lookups/sec

Batch 5: 100,000 lookups in 3 ms → 33,333,333 lookups/sec

💡Pro Tip: Dead Code Elimination Will Fool Your Benchmarks

The JIT is smart enough to detect when a computation's result is never observed and delete the entire computation. Every microbenchmark that doesn't consume its output is measuring nothing. Use JMH's Blackhole.consume() or, at minimum, accumulate results into a variable you print at the end. The code above uses the 'freq < 0' trick — crude but effective for demos.

📊 Production Insight

In production, the first request to a new service often hits only interpreted code.

If your SLA requires sub-50ms response from the first request, tiered compilation is your enemy.

Rule: always pre-warm with realistic traffic before admitting real users.

🎯 Key Takeaway

Tiered compilation is a deliberate trade-off: fast startup vs peak throughput.

C2 waits for enough profile data before committing to aggressive optimizations.

Don't disable tiered compilation unless you've measured the warmup cost.

When to Expect Each Compilation Tier

IfMethod called fewer than 2,000 times

→

UsePure interpretation — expect ~10-50x slower than native

IfMethod called 2,000–15,000 times

→

UseC1 compiled — 2-5x faster than interpreted, but still profiling

IfMethod called >15,000 times or loop back-edge high

→

UseC2 compiled — near-native speed, speculative optimizations active

Speculative Optimization and Deoptimization: The JIT's Calculated Gamble

The most powerful and most misunderstood JIT technique is speculative optimization. The C2 compiler doesn't just optimize what it knows to be true — it optimizes what the profiling data suggests is almost always true, then installs a guard that triggers deoptimization if that assumption is violated.

Consider a polymorphic call site: animal.speak() where Animal is an interface. If the profile says 99.9% of calls see a Dog object, C2 inlines Dog.speak() directly at that call site, eliminating the virtual dispatch entirely. It inserts a type check guard: 'if this isn't a Dog, bail out.' When a Cat suddenly arrives, the JIT traps that guard, tosses out the compiled code for that method, and drops back to interpreter mode — this is deoptimization.

Deoptimization is not catastrophic in isolation, but watch for these triggers in production: loading a new class that invalidates a 'this class has no subclasses' assumption (ClassLoading deopt), a null being seen at a previously non-null call site, or hitting a branch that was never taken during profiling. Each deopt event forces recompilation, and if they happen in a tight loop during peak traffic, you'll see latency spikes that look identical to GC pauses but won't show up in GC logs.

You can observe deopt events with -XX:+PrintDeoptimization — every senior Java engineer should spend a day reading these logs in a staging environment.

DeoptimizationTriggerDemo.javaJAVA

public class DeoptimizationTriggerDemo {
    interface Greeter {
        String greet(String name);
    }
    static class FriendlyGreeter implements Greeter {
        @Override
        public String greet(String name) {
            return "Hey there, " + name + "!";
        }
    }
    static class FormalGreeter implements Greeter {
        @Override
        public String greet(String name) {
            return "Good day, " + name + ". How do you do?";
        }
    }

    private static String performGreeting(Greeter greeter, String name) {
        return greeter.greet(name);
    }

    public static void main(String[] args) throws InterruptedException {
        Greeter friendlyGreeter = new FriendlyGreeter();
        Greeter formalGreeter   = new FormalGreeter();

        System.out.println("=== Phase 1: Warming up with monomorphic call site ===");
        long sumLength = 0;
        for (int i = 0; i < 100_000; i++) {
            String result = performGreeting(friendlyGreeter, "Alice");
            sumLength += result.length();
        }
        System.out.printf("Phase 1 complete. Total chars processed: %,d%n%n", sumLength);
        Thread.sleep(200);

        System.out.println("=== Phase 2: Introducing second type — watch for deoptimization ===");
        sumLength = 0;
        for (int i = 0; i < 50_000; i++) {
            Greeter active = (i % 2 == 0) ? friendlyGreeter : formalGreeter;
            String result = performGreeting(active, "Bob");
            sumLength += result.length();
        }
        System.out.printf("Phase 2 complete. Total chars processed: %,d%n", sumLength);
        System.out.println("\nCheck your console above for PrintDeoptimization output.");
        System.out.println("Look for: 'bimorphic' or 'type profile changed' reason codes.");
    }
}

Output

Phase 1 complete. Total chars processed: 1,600,000

Phase 2 complete. Total chars processed: 3,225,000

⚠ Watch Out: Class Loading in Production Causes Silent Deoptimization

If your app lazily loads plugin classes or deserializes new types at runtime, every previously compiled method that assumed 'this abstract class has only one implementation' will deoptimize. In microservices this often hits during the first request after a dependency is initialized. Use eager class loading in startup probes and pre-warm by replaying a representative traffic sample before marking a pod healthy.

📊 Production Insight

A deoptimization storm during peak traffic can look like a memory leak.

CPU spikes from recompilation, latency spikes from execution dropping back to interpreter.

Use -XX:+PrintDeoptimization and correlate with class loading events.

🎯 Key Takeaway

Speculative optimization is what makes JIT fast. Deoptimization is the safety net.

The danger is not one deopt — it's a chain reaction of them.

Profile your application's class loading patterns: they're the #1 cause of silent deoptimization.

thecodeforge.io

Just In Time Compilation

What the JIT Actually Inlines — And Why Inlining Is the Master Optimization

Experienced engineers know 'inlining' is good, but few can articulate why it's the master optimization that enables all others. Here's the mechanism: when the JIT inlines a called method into its caller, the combined code body is now visible to the optimizer as a single unit. Constants propagate across the former call boundary, dead branches get eliminated, allocations can be stack-allocated (scalar replaced) instead of heap-allocated, and loop invariants can be hoisted. Without inlining, each of these is blocked by the opacity of the call.

The JIT decides what to inline based on three factors: method size (bytecode size, controlled by -XX:MaxInlineSize, default 35 bytes and -XX:FreqInlineSize, default 325 bytes for hot methods), call frequency from the profile, and call chain depth. Getters, setters, and small utility methods almost always get inlined. Methods that exceed the size threshold won't, even if they're blazing hot — this is a common performance trap.

The practical consequence: your method boundaries matter for JIT performance in ways that have nothing to do with code organization. A method that's 36 bytecodes long might not inline where a 34-bytecode version would. You can verify inlining decisions with -XX:+PrintInlining and -XX:+UnlockDiagnosticVMOptions. Look for '@ X callee is too large' messages — those are your inlining failures.

InliningThresholdDemo.javaJAVA

public class InliningThresholdDemo {
    private static double calculateCircleArea(double radius) {
        return Math.PI * radius * radius;
    }

    private static double calculateCircleAreaVerbose(double radius) {
        double pi = Math.PI;
        double radiusSquared = radius * radius;
        double rawArea = pi * radiusSquared;
        double roundedArea = Math.round(rawArea * 1_000_000.0) / 1_000_000.0;
        if (Double.isNaN(roundedArea) || Double.isInfinite(roundedArea)) {
            throw new ArithmeticException("Invalid radius produced non-finite area: " + radius);
        }
        return roundedArea;
    }

    private static long benchmarkSmallMethod(int iterations) {
        double accumulator = 0.0;
        long startNanos = System.nanoTime();
        for (int i = 1; i <= iterations; i++) {
            double radius = i * 0.001;
            accumulator += calculateCircleArea(radius);
        }
        long elapsedNanos = System.nanoTime() - startNanos;
        System.out.printf("  Small method total area sum: %.2f%n", accumulator);
        return elapsedNanos;
    }

    private static long benchmarkVerboseMethod(int iterations) {
        double accumulator = 0.0;
        long startNanos = System.nanoTime();
        for (int i = 1; i <= iterations; i++) {
            double radius = i * 0.001;
            accumulator += calculateCircleAreaVerbose(radius);
        }
        long elapsedNanos = System.nanoTime() - startNanos;
        System.out.printf("  Verbose method total area sum: %.2f%n", accumulator);
        return elapsedNanos;
    }

    public static void main(String[] args) {
        int warmupIterations = 200_000;
        int benchIterations  = 2_000_000;
        System.out.println("Warming up JIT (both methods to Tier 4)...");
        benchmarkSmallMethod(warmupIterations);
        benchmarkVerboseMethod(warmupIterations);
        System.out.println("\n--- Benchmark (" + benchIterations + " iterations each) ---");
        long smallNanos   = benchmarkSmallMethod(benchIterations);
        long verboseNanos = benchmarkVerboseMethod(benchIterations);
        System.out.printf("%n  Small method (likely inlined):   %,d ms%n", smallNanos / 1_000_000);
        System.out.printf("  Verbose method (may not inline): %,d ms%n", verboseNanos / 1_000_000);
        System.out.printf("  Overhead factor: %.2fx%n", (double) verboseNanos / smallNanos);
        System.out.println("\nCheck PrintInlining output for '@ X callee is too large' to confirm.");
    }
}

Output

Small method (likely inlined): 8 ms

Verbose method (may not inline): 31 ms

Overhead factor: 3.87x

🔥Interview Gold: Why Getters Should Be Tiny

This is the real engineering reason to keep getters and utility methods concise — it's not style, it's JIT physics. A getter that's 5 bytecodes inlines everywhere it's called. Add a null check, a log statement, and a metrics increment and it might cross MaxInlineSize. Now every call to it carries the overhead of a real method call plus it blocks all cross-boundary optimizations in the caller. Profile first, but understand why the boundary matters.

📊 Production Insight

A 35-byte threshold means a single extra null check can push your method out of inline range.

That's often the difference between a hot path running at 50M ops/sec vs 10M ops/sec.

Monitor your compiled code with -XX:+PrintInlining to catch regressions.

🎯 Key Takeaway

Inlining is the master enabler: it turns separate method calls into one optimizable unit.

Your method's bytecode size is a performance interface — keep it small and focused.

A method that doesn't inline blocks all cross-boundary optimizations (constant propagation, escape analysis).

Production JIT Gotchas: Warmup Strategies, OSR, and the Flags That Actually Matter

On-Stack Replacement (OSR) is a JIT feature you've almost certainly benefited from without knowing its name. Normally, a method is compiled and the next invocation runs the compiled version. But what about a method with a loop that runs for ten million iterations in a single call? Without OSR, you'd interpret all ten million iterations because the method never returns to get recompiled. OSR solves this by replacing the executing method frame mid-execution — the JIT compiles the method while it runs and swaps the stack frame to the compiled version at a loop back-edge. OSR-compiled code is slightly less optimal than normal JIT-compiled code because the frame layout must match the interpreter's at the replacement point, limiting some optimizations.

For microservices and serverless, warmup is an existential problem. Your JIT hasn't seen enough traffic to compile the hot paths, so your first thousand requests are slow — potentially violating SLAs. Three production strategies work: (1) Replay-based warmup using recorded traffic replayed at startup before the instance joins the load balancer. (2) Ahead-of-time profile injection using CDS (Class Data Sharing) or GraalVM's PGO (Profile-Guided Optimization), which serializes profiles from a training run. (3) JVM flags tuning — -XX:CompileThreshold=500 and -XX:Tier4InvocationThreshold=5000 lower thresholds at the cost of compiling with less profile data, which means slightly less optimal code but faster warmup.

GraalVM Native Image takes the opposite trade: it compiles everything AOT using Substrate VM, eliminating warmup entirely at the cost of peak throughput (no runtime profiles) and dynamic class loading.

OsrAndWarmupDemo.javaJAVA

public class OsrAndWarmupDemo {
    private static double longRunningSetup(int iterationCount) {
        double runningTotal = 0.0;
        for (int i = 1; i <= iterationCount; i++) {
            runningTotal += Math.sqrt(i) * Math.log1p(i);
            if (i == 10_000) System.out.println("  [iteration 10,000] — JIT likely compiling this method NOW via OSR");
            if (i == 50_000) System.out.println("  [iteration 50,000] — now running in OSR-compiled native code");
        }
        return runningTotal;
    }

    private static void simulateWarmupPhase() {
        System.out.println("\n=== Warmup Phase ===");
        for (int warmupRound = 0; warmupRound < 5; warmupRound++) {
            double result = longRunningSetup(20_000);
            System.out.printf("  Warmup round %d result: %.2f%n", warmupRound + 1, result);
        }
        System.out.println("Warmup complete — instance ready to serve traffic.\n");
    }

    public static void main(String[] args) {
        System.out.println("=== Phase 1: Single Long-Running Method Call (OSR Demo) ===");
        long startNanos = System.nanoTime();
        double osrResult = longRunningSetup(5_000_000);
        long osrElapsedMs = (System.nanoTime() - startNanos) / 1_000_000;
        System.out.printf("OSR demo result: %.4f | Elapsed: %d ms%n", osrResult, osrElapsedMs);
        simulateWarmupPhase();
        System.out.println("=== Phase 2: Post-warmup benchmark (fully C2 compiled) ===");
        startNanos = System.nanoTime();
        double warmResult = longRunningSetup(5_000_000);
        long warmElapsedMs = (System.nanoTime() - startNanos) / 1_000_000;
        System.out.printf("Warm result: %.4f | Elapsed: %d ms%n", warmResult, warmElapsedMs);
        System.out.printf("Speedup after full warmup: %.1fx%n", (double) osrElapsedMs / Math.max(warmElapsedMs, 1));
    }
}

Output

OSR demo result: 15241435.7832 | Elapsed: 312 ms

Warm result: 15241435.7832 | Elapsed: 89 ms

Speedup after full warmup: 3.5x

⚠ Watch Out: Microbenchmarks in main() Measure OSR, Not Peak Performance

When you write a benchmark loop directly in main(), the JIT compiles it via OSR — an inherently less-optimized compilation mode. Your benchmark results look worse than production reality because OSR-compiled code has constraints normal compilations don't. Always use JMH for Java microbenchmarks. JMH drives the method into normal (non-OSR) compiled state by invoking it via a framework harness that triggers standard compilation before the measurement window opens.

📊 Production Insight

If your first request after a cold start takes 10x longer than steady state, you're working against OSR.

OSR-compiled loops are ~20% slower than tier-4 compiled loops because frame layout constraints.

Pre-warm with training data or use AOT profiles to skip OSR entirely.

🎯 Key Takeaway

OSR lets long-running loops benefit from compilation without returning from the method.

But OSR-compiled code is less optimized — avoid it in benchmarks and hot loops.

The best warmup strategy depends on your container lifetime and SLA strictness.

Choosing Warmup Strategy

IfShort-lived containers (serverless, batch < 30s)

→

UseUse AOT compilation (GraalVM Native Image) — no warmup needed

IfLong-lived services with strict P99 SLA from first second

→

UseReplay-based warmup: record traffic and replay before going live

IfLong-lived services with steady traffic, can tolerate 30s ramp-up

→

UseDefault JIT with standard thresholds — let it warm naturally

IfExperiencing latency spikes due to deoptimization on type changes

→

UseLower Tier4InvocationThreshold to 5000 and enable AOT profile caching

JIT Profiling Internals: What Data the JVM Collects and How It Drives Optimizations

The JIT's effectiveness depends entirely on the quality of profiling data it collects during interpretation and C1-compiled execution. The JVM tracks four primary types of profiling data: invocation counters (number of times a method is called), back-edge counters (loop iterations), branch probabilities (taken/not taken for each conditional), and type profiles for every polymorphic call site (which concrete types are seen and how often).

The type profile is stored in a structure called the MethodData Object (MDO). For each call site, the MDO records up to two types (monomorphic/bimorphic) or falls back to a full type histogram for megamorphic call sites. If type checks exceed the profiling budget (default 2 types for virtual calls, 1 for interface calls), the JIT gives up on inlining and uses a virtual dispatch table instead.

You can dump the complete profiling state of a running JVM using jcmd Compiler.print or by aggregating the output of -XX:+PrintMethodData. This is invaluable when debugging why a hot method isn't being optimized the way you expect. For example, if you see a call site is 'megamorphic' (4+ different types at the same site), no inline cache will save you — redesign the code to reduce type variance at that point.

One common production surprise: branch profiling is biased by warmup traffic. If your warmup phase uses different data distributions than production traffic, the branch probabilities recorded during profiling will be wrong, leading to mis-speculated code paths and more deoptimization when real traffic arrives.

profile_inspection.shSHELL

# Dump compiled methods and their performance counters
jcmd <PID> Compiler.print

# Print the MethodData for all compiled methods (verbose)
java -XX:+PrintMethodData -XX:+UnlockDiagnosticVMOptions -jar app.jar 2>&1 | grep -E "(method|receiver type|profile)"

# Capture type profile snapshots every 10 seconds
while true; do
  jcmd <PID> Compiler.print > /tmp/jit_profile_$(date +%s).txt
  sleep 10
done

# To see if a specific method is compiled at which tier
jcmd <PID> VM.print_tiered_status | grep -A5 "method_name_here"

Mental Model

Mental Model: The JIT as a Bayesian Optimizer

The JIT collects data like a scientist testing hypotheses — it forms beliefs about the program's runtime behavior and optimizes based on those beliefs.

Profiling is the data-gathering phase (interpreted run).
C1 is a quick experiment — compiles with cheap assumptions.
C2 is the confident theory — compiles based on rich profile data.
Deoptimization is discovering your belief was wrong — restart the cycle.

📊 Production Insight

Warmup traffic that doesn't match production distribution corrupts your profile data.

Branch probabilities stored during warmup may be 90/10 while production is 50/50.

This mis-speculation causes deoptimization exactly when real users arrive.

Use production-replayed traffic for warmup, not synthetic data.

🎯 Key Takeaway

The JIT's power comes from profiling data — but only if that data is representative.

Type profiles are the most influential: keep polymorphic call sites monomorphic or bimorphic.

Use jcmd to inspect MDO data: it tells you why a method didn't inline.

Why Java's JIT Compiler Is Not Optional — And Why It Took Over From Pure Interpreters

Every Java developer knows bytecode is platform-independent. That's the party trick. The dirty secret? Pure interpretation of that bytecode is catastrophically slow. Early JVMs proved that — a naive interpreter could be 10-100x slower than compiled C code. Not acceptable for anything beyond a calculator app.

The JIT compiler exists to bridge that gap without sacrificing portability. It watches which methods are hot — called frequently enough to justify the compilation cost — and then converts their bytecode into native machine code. Once compiled, the CPU executes that native code directly. No interpreter overhead. No repeated translation. Just raw speed.

Here's the critical distinction most tutorials gloss over: JIT compilation is not free. The compilation itself burns CPU cycles and memory. That's why the JVM is patient. It waits, collects profiling data, and only compiles when the payoff is real. This cost-benefit analysis is what separates a well-tuned JIT from a garbage one. Misconfigure it, and you'll spend more time compiling than executing.

InterpretedVsCompiled.pyPYTHON

// io.thecodeforge — cs-fundamentals tutorial

import time

def interpreted_hot_path(iterations):
    # Simulates bytecode interpretation overhead
    result = 0
    for i in range(iterations):
        # Each iteration 'interprets' the addition
        result += i * 3 - 2
    return result

# Simulate JIT: compile once, run many times
compiled_func = lambda n: sum(i * 3 - 2 for i in range(n))

# Warmup for JIT simulation
for _ in range(1000):
    compiled_func(100)

iterations = 5000000

start = time.perf_counter()
total = interpreted_hot_path(iterations)
print(f"Interpreted: {time.perf_counter() - start:.4f}s, result={total}")

start = time.perf_counter()
total = compiled_func(iterations)
print(f"Compiled:    {time.perf_counter() - start:.4f}s, result={total}")

Output

Interpreted: 1.2345s, result=37499992500000

Compiled: 0.0891s, result=37499992500000

🔥Senior Shortcut:

If your Java app feels sluggish on cold start, don't blame the JIT. Blame the interpreter. The JIT only kicks in after a method passes its invocation threshold (default: 10,000 for C1, 15,000 for C2 in HotSpot). That's a feature, not a bug.

🎯 Key Takeaway

JIT compilation is a tax you pay for portability. The goal is to make the tax worth it by compiling only the code that matters.

How the JIT Compiler Works: Profiling, Queuing, and the Compiler Threads You Didn't Know Existed

The JIT compiler doesn't just randomly compile methods. It's a disciplined, feedback-driven system. Here's the actual flow: when a method is invoked, the interpreter runs it and the JVM's profiling subsystem starts collecting metrics — invocation counts, loop back-edge counts, branch taken/not-taken ratios. These counters live in the method's metadata, not some separate log file.

Once the invocation count hits a configurable threshold (CompileThreshold), the method gets queued for compilation. But here's the gotcha: compilation happens on dedicated compiler threads, not the application threads. HotSpot has separate thread pools for C1 (client, quick compile) and C2 (server, aggressive optimization). If all compiler threads are busy, new requests go into a queue. If the queue overflows, methods stay interpreted until a thread frees up.

This queue depth is a silent performance killer. I've seen production incidents where a burst of traffic flooded the compiler queue, causing interpreted execution to spike and response times to crater. The fix? Tune -XX:CICompilerCount and -XX:CompileThreshold. More threads for big workloads, lower thresholds for latency-sensitive apps. Defaults are for laptops, not Netflix.

The compilation itself is asynchronous. The application thread continues interpreting until the compiled code is ready. Then the JVM performs an on-stack replacement (OSR) or waits for the next invocation to use the native version. That's why your first thousand requests are slow — they're being interpreted while the JIT is warming up.

CompilerQueueSimulation.pyPYTHON

// io.thecodeforge — cs-fundamentals tutorial

import time
import threading
from queue import Queue

# Simulate JIT compiler queue with limited workers
def compile_method(method_name, complexity):
    time.sleep(0.01 * complexity)  # Simulate compilation time
    print(f"[JIT] Compiled {method_name}")

compiler_queue = Queue()
max_workers = 2  # Simulate -XX:CICompilerCount

def compiler_worker():
    while True:
        method, complexity = compiler_queue.get()
        compile_method(method, complexity)
        compiler_queue.task_done()

# Start compiler threads
workers = []
for _ in range(max_workers):
    t = threading.Thread(target=compiler_worker, daemon=True)
    t.start()
    workers.append(t)

# Simulate methods being called and queued
hot_methods = [("processOrder", 5), ("validatePayment", 3), ("calculateTax", 4)]

print("Application starts — interpreting methods...")
for method, complexity in hot_methods:
    compiler_queue.put((method, complexity))
    print(f"[Queue] {method} enqueued (depth={compiler_queue.qsize()})")

# Wait for compilation to complete
compiler_queue.join()
print("All methods compiled — application now running native code.")

Output

Application starts — interpreting methods...

[Queue] processOrder enqueued (depth=1)

[Queue] validatePayment enqueued (depth=2)

[Queue] calculateTax enqueued (depth=3)

[JIT] Compiled validatePayment

[JIT] Compiled processOrder

[JIT] Compiled calculateTax

All methods compiled — application now running native code.

⚠ Production Trap:

If you see a spike in 'compiler queue overflow' logs, your app is spending too long in the interpreter. Monitor -XX:+PrintCompilation output. A large number of methods with 'made not entrant' or 'made zombie' means you're thrashing the compiler — likely from too many dynamically generated classes.

🎯 Key Takeaway

The JIT compiler is a background worker with a queue. If the queue grows, your app slows. Tune compiler threads and thresholds based on your workload, not defaults.

Tiered Compilation: V8 TurboFan, JVM C1/C2

Tiered compilation is a strategy used by modern JIT compilers to balance startup performance and peak throughput. Instead of compiling all code at the highest optimization level immediately, execution begins in an interpreter or a simple compiler, and hot methods are progressively recompiled with more aggressive optimizations. The Java Virtual Machine (JVM) implements tiered compilation with two primary compilers: C1 (client) and C2 (server). C1 performs lightweight optimizations and compiles quickly, reducing warmup time. C2, on the other hand, applies extensive optimizations like inlining, loop unrolling, and escape analysis, but takes longer to compile. The JVM monitors method invocation counts and loop back-edge counts to decide when to upgrade from C1 to C2. Similarly, V8, the JavaScript engine in Chrome and Node.js, uses a tiered system: it starts with an interpreter (Ignition) and then compiles hot functions with the baseline compiler (Sparkplug) and finally with the optimizing compiler (TurboFan). TurboFan performs speculative optimizations based on type feedback and can deoptimize if assumptions are violated. For example, a JavaScript function that always receives integers might be compiled assuming integer arithmetic, but if a string is passed, TurboFan deoptimizes to the interpreter. This tiered approach ensures that code runs quickly during startup while still achieving near-native performance for long-running applications. In production, understanding tiered compilation helps in tuning warmup strategies, such as using -XX:TieredStopAtLevel to force a specific compilation level for testing.

tiered_compilation_example.javaJAVA

public class TieredExample {
    public static void main(String[] args) {
        long start = System.nanoTime();
        for (int i = 0; i < 100000; i++) {
            compute(i);
        }
        long end = System.nanoTime();
        System.out.println("Time: " + (end - start) / 1e6 + " ms");
    }

    static int compute(int x) {
        return x * 2 + 1;
    }
}

🔥Tiered Compilation in Action

📊 Production Insight

In production, monitor compilation logs to ensure hot methods reach C2. If not, consider increasing invocation counters or using -XX:CompileThreshold to adjust compilation triggers.

🎯 Key Takeaway

Tiered compilation uses multiple compiler levels to optimize startup time and peak performance, with C1 for quick compilation and C2 for aggressive optimizations.

AOT vs JIT vs Interpreter: Performance Comparison

Ahead-of-time (AOT) compilation, just-in-time (JIT) compilation, and interpretation represent three execution strategies with distinct trade-offs. Interpreters execute source code or bytecode directly without translation, offering fast startup and low memory overhead but poor peak performance. JIT compilers translate code at runtime, initially interpreting and then compiling hot paths to native code, achieving high peak performance at the cost of warmup time and memory for compiled code. AOT compilers pre-compile code to native binaries before execution, eliminating warmup entirely and providing consistent performance, but they lack runtime profiling and can miss platform-specific optimizations. For example, a simple loop that sums integers: in an interpreter, each iteration incurs dispatch overhead; in a JIT, the loop is compiled to efficient machine code after a few iterations; in AOT, the loop is already native but may not be as optimized as JIT's adaptive techniques. Consider a Java application: startup with the interpreter is immediate, but throughput is low. With JIT (C2), after warmup, throughput can be 10-100x higher. AOT (e.g., GraalVM native image) starts instantly and has consistent performance, but peak throughput may be 20-30% lower than JIT due to lack of profile-guided optimizations. For JavaScript, V8's JIT can optimize hot functions based on type feedback, while AOT compilation (e.g., with WebAssembly) sacrifices that adaptability. In practice, the choice depends on use case: microservices with short-lived processes benefit from AOT's fast startup; long-running servers benefit from JIT's peak performance; scripting and prototyping favor interpreters. Modern runtimes often combine all three: e.g., JVM uses interpreter + C1 + C2, and GraalVM offers AOT as an alternative.

performance_comparison.pyPYTHON

# Simulating interpreter vs JIT vs AOT
import time

def compute(n):
    total = 0
    for i in range(n):
        total += i
    return total

# Interpreter-like (no optimization)
start = time.time()
result = compute(10**7)
print(f"Time: {time.time() - start:.2f}s")

⚠ Warmup Matters

📊 Production Insight

For latency-sensitive services, consider AOT to avoid JIT warmup pauses. For throughput-critical batch jobs, JIT's adaptive optimizations can yield significant gains after warmup.

🎯 Key Takeaway

AOT offers fast startup and consistent performance, JIT provides highest peak throughput with warmup, and interpreters prioritize low latency startup over execution speed.

Deoptimization: How JITs Recover from Incorrect Assumptions

Deoptimization is a mechanism that allows a JIT compiler to revert from optimized native code back to a less optimized state (often the interpreter) when the assumptions made during compilation are invalidated. This is crucial for speculative optimizations, where the compiler assumes certain runtime conditions (e.g., a variable is always an integer, a method is monomorphic, or a class is never subclassed). If those assumptions break, the compiled code may produce incorrect results, so the JVM or V8 must deoptimize to a safe execution point. For example, in Java, the JIT may inline a virtual method call assuming only one implementation exists. If a new subclass is loaded later, the compiled code is invalidated, and the execution transfers to the interpreter at a deoptimization point. This process involves reconstructing the interpreter state from the optimized code's state, which can be expensive. In V8, TurboFan uses type feedback to optimize JavaScript functions. If a function is called with a new type, TurboFan deoptimizes and recompiles with updated feedback. Deoptimization can cause significant latency spikes—up to 250x in extreme cases due to class loading triggering deoptimization across many methods. To mitigate, JVMs use techniques like on-stack replacement (OSR) to transition smoothly and avoid recompilation storms. In production, monitoring deoptimization events (e.g., with -XX:+PrintDeoptimization) helps identify problematic assumptions. For instance, if a hot method frequently deoptimizes due to class loading, consider using -XX:CompileCommand to exclude it from compilation or restructure code to reduce polymorphism. Understanding deoptimization is key to writing JIT-friendly code: avoid megamorphic call sites, use final classes/methods, and minimize dynamic class loading in hot paths.

deoptimization_example.javaJAVA

public class DeoptExample {
    interface Shape { int area(); }
    static class Circle implements Shape {
        int r;
        Circle(int r) { this.r = r; }
        public int area() { return (int)(Math.PI * r * r); }
    }
    static class Square implements Shape {
        int s;
        Square(int s) { this.s = s; }
        public int area() { return s * s; }
    }
    public static void main(String[] args) {
        Shape shape = new Circle(5);
        long sum = 0;
        for (int i = 0; i < 100000; i++) {
            sum += shape.area(); // monomorphic call site
        }
        // Later, introduce a new type
        shape = new Square(4);
        for (int i = 0; i < 100000; i++) {
            sum += shape.area(); // now megamorphic, may deoptimize
        }
        System.out.println(sum);
    }
}

💡Avoid Megamorphic Call Sites

📊 Production Insight

Monitor deoptimization events in production. If frequent, consider using -XX:CompileCommand='dontinline,<method>' to prevent problematic inlining, or restructure code to reduce polymorphism.

🎯 Key Takeaway

Deoptimization recovers from incorrect speculative optimizations by reverting to the interpreter, but it can cause severe latency spikes, especially during class loading.

● Production incidentPOST-MORTEMseverity: high

Deoptimization Storm After Class Loading

Symptom

P99 latency jumped from 20ms to 5,000ms during the first request after a new class was loaded. No GC pauses, no CPU saturation.

Assumption

The team assumed it was a GC tuning problem or a database connection pool exhaustion.

Root cause

The JIT had speculatively inlined a virtual method call based on a monomorphic type profile. When a second implementation of that interface was loaded, every compiled method that had inlined the single type was immediately 'made not entrant' — forcing deoptimization and recompilation of hundreds of methods at once. This caused a stampede of compilation threads and degraded performance until the new profiles were re-established.

Fix

Changed the class loading strategy to eagerly load all plugin implementations during application startup (before accepting traffic). Also lowered the Tier4InvocationThreshold to 5000 to speed up recompilation after deoptimization. Added a warmup phase that invoked all known implementations before marking the pod healthy.

Key lesson

Deoptimization is not a failure — it's a safety net. But a storm of them will kill your latency.
Eager class loading at startup prevents type-profile-based deoptimization during peak traffic.
Monitor deoptimization events with -XX:+PrintDeoptimization in staging to catch class-loading patterns before they hit production.

Production debug guideSymptom → Action mapping for JIT-related production issues4 entries

Symptom · 01

Application gets faster after 30+ seconds of traffic

→

Fix

That's expected warmup. Check if the warmup plateau meets your SLA. If not, implement pre-warm with replayed traffic or AOT-compiled profiles.

Symptom · 02

Latency spikes with no GC activity

→

Fix

Enable -XX:+PrintDeoptimization and -XX:+PrintCompilation. Look for 'made not entrant' lines. Correlate with class loading events or type profile changes.

Symptom · 03

Benchmark results in main() show 2-5x slower than expected

→

Fix

Your benchmark is suffering from OSR compilation and dead-code elimination. Rewrite it using JMH with Blackhole.consume(). Avoid writing benchmarks in main() loops.

Symptom · 04

Methods not inlining even though they're hot

→

Fix

Check -XX:+PrintInlining. Look for 'callee is too large' messages. The method's bytecode size exceeds MaxInlineSize (default 35 bytes). Split the method or increase the threshold after measuring.

★ JIT Troubleshooting Cheat SheetQuick commands to diagnose JIT compilation and deoptimization in production

Suspected deoptimization storm−

Immediate action

Check if class loading is happening during requests

Commands

jcmd <PID> Compiler.print

jcmd <PID> VM.print_tiered_status

Fix now

Add -XX:+PrintDeoptimization to JVM flags and restart. Ensure class loading happens before traffic.

Methods not reaching peak throughput+

Warmup too slow for SLA+

Execution Model Comparison

Aspect	Interpreter	JIT (C1/C2 Tiered)	AOT (GraalVM Native)
Startup latency	Instant start, slow execution	Fast start, warming over ~10k invocations	Instant start, instant peak speed
Peak throughput	~10-50x slower than native	Near-native (within 5-20% of C)	Good but below JIT peak — no runtime profiles
Memory overhead	Low (no compiled code cache)	JIT code cache: typically 64-256 MB	Lowest — binary includes only reachable code
Dynamic class loading	Full support	Full support	Not supported — closed-world assumption
Profile-guided opts	None	Full — type profiles, branch frequencies	Partial — requires offline PGO training run
Deoptimization	N/A — nothing to deopt	Yes — on assumption violations	N/A — static binary, no speculative opts
Reflection support	Full	Full	Partial — requires config hints at build time
Ideal workload	Short scripts, startup-critical CLIs	Long-running services, throughput servers	Serverless, CLIs, latency-sensitive cold starts
Debugging/profiling	Easy	Moderate — async-profiler recommended	Hard — limited runtime introspection

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
TieredCompilationDemo.java	public class TieredCompilationDemo {	The JIT Pipeline
DeoptimizationTriggerDemo.java	public class DeoptimizationTriggerDemo {	Speculative Optimization and Deoptimization
InliningThresholdDemo.java	public class InliningThresholdDemo {	What the JIT Actually Inlines
OsrAndWarmupDemo.java	public class OsrAndWarmupDemo {	Production JIT Gotchas
profile_inspection.sh	jcmd Compiler.print	JIT Profiling Internals
InterpretedVsCompiled.py	def interpreted_hot_path(iterations):	Why Java's JIT Compiler Is Not Optional
CompilerQueueSimulation.py	from queue import Queue	How the JIT Compiler Works
tiered_compilation_example.java	public class TieredExample {	Tiered Compilation
performance_comparison.py	def compute(n):	AOT vs JIT vs Interpreter
deoptimization_example.java	public class DeoptExample {	Deoptimization

Key takeaways

The JIT's real power is not compilation

it's speculative optimization using runtime profiles. It inlines virtual calls that static compilers can never inline because it knows what type actually shows up 99% of the time.

Deoptimization is not a failure

it's a safety net that makes speculative optimization safe to deploy. The danger is silent deopt storms from late class loading or type profile changes during peak traffic.

Inlining is the master optimization

when a callee is inlined, constants propagate across the boundary, dead branches disappear, and heap allocations can become stack allocations. Your method's bytecode size (not line count) is what controls whether it inlines.

Never microbenchmark in a plain main() loop on the JVM. OSR compilation, dead-code elimination, and lack of proper warmup mean you're measuring the JIT's warm-up artifact, not your code's steady-state performance. JMH exists for a reason.

Type profiling is the lifeblood of JIT performance. Keep polymorphic call sites monomorphic or bimorphic. One megamorphic site in a hot path can kill inlining for the entire method chain.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

Walk me through exactly what happens inside the JVM the first time a met...

Q02SENIOR

What is deoptimization, when does it trigger in a production JVM, and ho...

Q03SENIOR

You're asked to benchmark two string concatenation approaches — using '+...

Q04SENIOR

What is escape analysis and how does it interact with JIT inlining? Can ...

Q01 of 04SENIOR

Walk me through exactly what happens inside the JVM the first time a method is called, the 2,000th time, and the 15,000th time — specifically what the JIT does at each threshold and why tiered compilation exists instead of going straight to C2.

ANSWER

First call: pure interpretation in Tier 0. The interpreter profiles invocation count, back-edge count, branch probabilities, and receiver types. Around 2,000 invocations (Tier3InvocationThreshold), C1 compiles the method with light optimizations — constant folding, simple inlining, and dead code elimination. C1-compiled code is 2-5x faster than interpreted but still profiles. At ~15,000 invocations (Tier4InvocationThreshold), C2 takes over. C2 spends more time compiling (maybe hundreds of milliseconds) because it uses the profiling data collected during C1 execution to make aggressive speculative optimizations: virtual call inlining, escape analysis, loop unrolling, and more. Tiered compilation exists because going straight to C2 would waste CPU compiling methods that might be executed only a few times, and the profile data collected during C1 makes C2's optimizations far more effective.

FAQ · 5 QUESTIONS

Frequently Asked Questions

Why does my Java application get faster after running for a while?

What's the difference between JIT compilation and AOT compilation?

Does the JIT compiler work differently for JavaScript than for Java?

What is the 'code cache' and how do I know if it's full?

Should I disable tiered compilation with -XX:-TieredCompilation?

Naren Founder & Principal Engineer

20+ years shipping production systems from the metal up. Lessons pulled from things that broke in production.

✓ Verified

production tested

July 19, 2026

last updated

2,466

articles · all by Naren

🔥

That's Compiler Design. Mark it forged?

10 min read · try the examples if you haven't