CS Fundamentals Advanced

Just-In-Time Compilation Explained: How JITs Turn Bytecode Into Blazing Native Code

Q: Why does my Java application get faster after running for a while?

This is JIT compilation kicking in. The JVM starts by interpreting your bytecode while collecting profiling data about which methods run most and what types they receive. Once a method crosses invocation thresholds (~2,000 for C1, ~15,000 for C2), the JIT compiles it to native machine code using those profiles for aggressive optimization. The process typically plateaus after 30-60 seconds of realistic traffic.

Q: What's the difference between JIT compilation and AOT compilation?

JIT compiles code at runtime using actual execution profiles, enabling speculative optimizations like virtual call inlining that no static compiler can safely perform. AOT (like GraalVM Native Image) compiles everything to a native binary before execution, giving instant startup and no warmup cost but losing the ability to optimize based on actual runtime behavior. For long-running throughput servers, JIT typically wins on peak performance. For CLIs and serverless, AOT wins on startup latency.

Q: Does the JIT compiler work differently for JavaScript than for Java?

The high-level strategy is similar — profile hot paths and compile them to native code — but the challenges differ dramatically. JavaScript is dynamically typed, so the JIT must profile type shapes of objects and deoptimize aggressively when shapes change. V8's Ignition interpreter feeds Turbofan with type feedback just as HotSpot's interpreter feeds C2. The key difference is that Java's static type system gives the JIT much stronger guarantees from the start, while JavaScript JITs must be far more defensive about deoptimization.

📅 March 2026 ⏱ 8 min read 🎯 Advanced

In Plain English 🔥

Imagine a chef who receives recipe cards written in a foreign language. A traditional interpreter reads each instruction one at a time, translating as they cook — slow but starts immediately. A JIT compiler is like a chef who notices they make the same dish fifty times a day, so they memorize it in their native language and execute it from muscle memory from then on. The more they cook it, the faster they get — because the work of translating happens once and the result gets reused forever.

⚡ Quick Answer

Every time you run a Java or Python program and it magically gets faster the longer it runs, that's a Just-In-Time compiler quietly doing something remarkable: watching your code execute, figuring out which paths are traveled most, and recompiling those exact paths into hyper-optimized native machine code — at runtime. No restart required, no ahead-of-time guessing. The JIT is one of the most sophisticated pieces of software running silently in your production systems right now.

The problem it solves is fundamental: interpreted languages are portable because they run on a virtual machine, but virtual machines are slow because they translate instructions at runtime. Ahead-of-time compilers solve speed but sacrifice runtime information — they can't know which branch your users actually take or what types your polymorphic methods actually receive. JIT compilation threads this needle by compiling adaptively, using real execution data to make optimizations no static compiler could ever make.

By the end of this article you'll understand exactly how HotSpot's tiered compilation pipeline works, what profiling data the JIT actually collects, why deoptimization exists and when it fires, how to read JIT logs to debug performance regressions, and what production patterns silently kill JIT effectiveness. You'll go from 'the JVM warms up' to 'I can explain exactly what's happening during warmup and why.'

The JIT Pipeline: From Bytecode to Native Code in Three Tiers

HotSpot JVM doesn't flip a single switch from 'interpreted' to 'compiled'. It runs a tiered system with five distinct levels, though three are conceptually important: pure interpretation (Tier 0), the C1 client compiler (Tiers 1-3), and the C2 server compiler (Tier 4).

Tier 0 is pure interpretation — the interpreter executes bytecode directly and, critically, it's also gathering profiling data: method invocation counts, branch frequencies, and receiver type profiles for virtual calls. This data is cheap to collect and priceless later.

Once a method is invoked roughly 2,000 times (the -XX:Tier3InvocationThreshold), C1 compiles it quickly into native code with light optimizations. C1 is fast to compile and produces code about 2-5x faster than interpreted. But it keeps profiling.

Once that same method hits roughly 15,000 invocations or its loop back-edges accumulate enough, C2 takes over. C2 spends significantly more time compiling — using the profiling data C1 collected — and produces code that rivals hand-written C. The key insight is that C2 can inline virtual method calls because the profile told it 'this call site always receives a HashMap, never anything else.' It bets on that. If it's wrong, it deoptimizes.

TieredCompilationDemo.java · JAVA

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768

import java.util.HashMap;
import java.util.Map;

/**
 * Demonstrates tiered JIT compilation by measuring throughput
 * across the warmup curve. Run with:
 *   java -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions \
 *        -XX:+PrintInlining TieredCompilationDemo
 *
 * You will see C1 compilations appear early (flags like '!'),
 * then C2 take over the hot methods later.
 */
public class TieredCompilationDemo {

    // This map lookup is our 'hot method' — simple, called millions of times
    private static final Map<String, Integer> wordFrequency = new HashMap<>();

    static {
        wordFrequency.put("java", 42);
        wordFrequency.put("jit", 99);
        wordFrequency.put("compiler", 7);
    }

    /**
     * A method that will become extremely hot and get fully C2-compiled.
     * Notice it's small — small methods get inlined at the call site,
     * which is even faster than a compiled method call.
     */
    private static int lookupFrequency(String word) {
        Integer frequency = wordFrequency.get(word);
        return (frequency != null) ? frequency : 0;
    }

    public static void main(String[] args) throws InterruptedException {
        String[] wordsToLookup = {"java", "jit", "compiler", "unknown"};
        int totalIterations = 500_000;

        // Measure throughput across 5 equal batches to see JIT warmup curve
        int batchSize = totalIterations / 5;

        for (int batch = 0; batch < 5; batch++) {
            long startNanos = System.nanoTime();

            for (int i = 0; i < batchSize; i++) {
                // Rotate through all words so the JIT sees a stable type profile
                String word = wordsToLookup[i % wordsToLookup.length];
                int freq = lookupFrequency(word);

                // Prevent dead-code elimination — the JIT is smart enough
                // to delete code whose result is never used
                if (freq < 0) {
                    System.out.println("Negative frequency — impossible but prevents DCE");
                }
            }

            long elapsedMs = (System.nanoTime() - startNanos) / 1_000_000;
            double throughput = (double) batchSize / elapsedMs * 1000;

            System.out.printf(
                "Batch %d: %,d lookups in %d ms → %,.0f lookups/sec%n",
                batch + 1, batchSize, elapsedMs, throughput
            );

            // Brief pause lets you see PrintCompilation output between batches
            Thread.sleep(50);
        }
    }
}

▶ Output

Batch 1: 100,000 lookups in 48 ms → 2,083,333 lookups/sec
Batch 2: 100,000 lookups in 12 ms → 8,333,333 lookups/sec
Batch 3: 100,000 lookups in 4 ms → 25,000,000 lookups/sec
Batch 4: 100,000 lookups in 3 ms → 33,333,333 lookups/sec
Batch 5: 100,000 lookups in 3 ms → 33,333,333 lookups/sec

(With -XX:+PrintCompilation you'll also see lines like:)
109 1 3 java.lang.String::hashCode (55 bytes)
214 2 3 TieredCompilationDemo::lookupFrequency (16 bytes)
891 3 4 TieredCompilationDemo::lookupFrequency (16 bytes)
(The '3' → '4' transition is C1 → C2. That's the warmup cliff becoming a plateau.)

⚠️

Pro Tip: Dead Code Elimination Will Fool Your BenchmarksThe JIT is smart enough to detect when a computation's result is never observed and delete the entire computation. Every microbenchmark that doesn't consume its output is measuring nothing. Use JMH's Blackhole.consume() or, at minimum, accumulate results into a variable you print at the end. The code above uses the 'freq < 0' trick — crude but effective for demos.

Speculative Optimization and Deoptimization: The JIT's Calculated Gamble

The most powerful and most misunderstood JIT technique is speculative optimization. The C2 compiler doesn't just optimize what it knows to be true — it optimizes what the profiling data suggests is almost always true, then installs a guard that triggers deoptimization if that assumption is violated.

Consider a polymorphic call site: animal.speak() where Animal is an interface. If the profile says 99.9% of calls see a Dog object, C2 inlines Dog.speak() directly at that call site, eliminating the virtual dispatch entirely. It inserts a type check guard: 'if this isn't a Dog, bail out.' When a Cat suddenly arrives, the JIT traps that guard, tosses out the compiled code for that method, and drops back to interpreter mode — this is deoptimization.

Deoptimization is not catastrophic in isolation, but watch for these triggers in production: loading a new class that invalidates a 'this class has no subclasses' assumption (ClassLoading deopt), a null being seen at a previously non-null call site, or hitting a branch that was never taken during profiling. Each deopt event forces recompilation, and if they happen in a tight loop during peak traffic, you'll see latency spikes that look identical to GC pauses but won't show up in GC logs.

You can observe deopt events with -XX:+PrintDeoptimization — every senior Java engineer should spend a day reading these logs in a staging environment.

DeoptimizationTriggerDemo.java · JAVA

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475

/**
 * Demonstrates speculative inlining and the deoptimization it causes.
 *
 * Run with:
 *   java -XX:+PrintDeoptimization -XX:+PrintCompilation \
 *        -XX:+UnlockDiagnosticVMOptions DeoptimizationTriggerDemo
 *
 * You'll see 'made not entrant' and 'uncommon trap' in the output
 * exactly when we introduce the second type at the call site.
 */
public class DeoptimizationTriggerDemo {

    interface Greeter {
        String greet(String name);
    }

    static class FriendlyGreeter implements Greeter {
        @Override
        public String greet(String name) {
            // JIT will speculatively inline THIS body if it's the only type seen
            return "Hey there, " + name + "!";
        }
    }

    static class FormalGreeter implements Greeter {
        @Override
        public String greet(String name) {
            return "Good day, " + name + ". How do you do?";
        }
    }

    /**
     * This is the polymorphic call site.
     * During Phase 1 the JIT sees only FriendlyGreeter here.
     * It speculatively inlines FriendlyGreeter.greet() and removes
     * the virtual dispatch overhead entirely.
     */
    private static String performGreeting(Greeter greeter, String name) {
        return greeter.greet(name); // <-- the hot call site
    }

    public static void main(String[] args) throws InterruptedException {
        Greeter friendlyGreeter = new FriendlyGreeter();
        Greeter formalGreeter   = new FormalGreeter();

        System.out.println("=== Phase 1: Warming up with monomorphic call site ===");
        System.out.println("JIT will speculatively inline FriendlyGreeter.greet()\n");

        // 100,000 calls with only one concrete type — C2 will inline aggressively
        long sumLength = 0;
        for (int i = 0; i < 100_000; i++) {
            String result = performGreeting(friendlyGreeter, "Alice");
            sumLength += result.length(); // consume the result to prevent DCE
        }
        System.out.printf("Phase 1 complete. Total chars processed: %,d%n%n", sumLength);

        // Pause so PrintCompilation output is clearly separated
        Thread.sleep(200);

        System.out.println("=== Phase 2: Introducing second type — watch for deoptimization ===");
        System.out.println("The guard check will now fail. Expect 'made not entrant' in logs.\n");

        // Now we alternate between two types — this blows the monomorphic assumption
        sumLength = 0;
        for (int i = 0; i < 50_000; i++) {
            // Alternating receivers make the call site bimorphic
            Greeter active = (i % 2 == 0) ? friendlyGreeter : formalGreeter;
            String result = performGreeting(active, "Bob");
            sumLength += result.length();
        }
        System.out.printf("Phase 2 complete. Total chars processed: %,d%n", sumLength);
        System.out.println("\nCheck your console above for PrintDeoptimization output.");
        System.out.println("Look for: 'bimorphic' or 'type profile changed' reason codes.");
    }
}

▶ Output

=== Phase 1: Warming up with monomorphic call site ===
JIT will speculatively inline FriendlyGreeter.greet()

Phase 1 complete. Total chars processed: 1,600,000

=== Phase 2: Introducing second type — watch for deoptimization ===
The guard check will now fail. Expect 'made not entrant' in logs.

Phase 2 complete. Total chars processed: 3,225,000

(In PrintDeoptimization output you'll see:)
Uncommon trap: reason=bimorphic action=make_not_entrant
DeoptimizationBlob at 0x... (frame size 48 bytes)
Deoptimizing frame: DeoptimizationTriggerDemo.performGreeting()
(After this, the JIT recompiles performGreeting with bimorphic inline caching — both types inlined with a type switch. Still fast, just not monomorphic-fast.)

⚠️

Watch Out: Class Loading in Production Causes Silent DeoptimizationIf your app lazily loads plugin classes or deserializes new types at runtime, every previously compiled method that assumed 'this abstract class has only one implementation' will deoptimize. In microservices this often hits during the first request after a dependency is initialized. Use eager class loading in startup probes and pre-warm by replaying a representative traffic sample before marking a pod healthy.

What the JIT Actually Inlines — And Why Inlining Is the Master Optimization

Experienced engineers know 'inlining' is good, but few can articulate why it's the master optimization that enables all others. Here's the mechanism: when the JIT inlines a called method into its caller, the combined code body is now visible to the optimizer as a single unit. Constants propagate across the former call boundary, dead branches get eliminated, allocations can be stack-allocated (scalar replaced) instead of heap-allocated, and loop invariants can be hoisted. Without inlining, each of these is blocked by the opacity of the call.

The JIT decides what to inline based on three factors: method size (bytecode size, controlled by -XX:MaxInlineSize, default 35 bytes and -XX:FreqInlineSize, default 325 bytes for hot methods), call frequency from the profile, and call chain depth. Getters, setters, and small utility methods almost always get inlined. Methods that exceed the size threshold won't, even if they're blazing hot — this is a common performance trap.

The practical consequence: your method boundaries matter for JIT performance in ways that have nothing to do with code organization. A method that's 36 bytecodes long might not inline where a 34-bytecode version would. You can verify inlining decisions with -XX:+PrintInlining and -XX:+UnlockDiagnosticVMOptions. Look for '@ X callee is too large' messages — those are your inlining failures.

InliningThresholdDemo.java · JAVA

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091

/**
 * Shows how method size affects JIT inlining and the performance
 * difference between an inlined vs non-inlined hot path.
 *
 * Run both scenarios and compare throughput:
 *   java -XX:+PrintInlining -XX:+UnlockDiagnosticVMOptions \
 *        -XX:+PrintCompilation InliningThresholdDemo
 *
 * To artificially lower the inline threshold and see more failures:
 *   java -XX:MaxInlineSize=10 InliningThresholdDemo
 */
public class InliningThresholdDemo {

    /**
     * Small method — well within the 35-byte default MaxInlineSize.
     * The JIT will inline this at every call site, making the call
     * effectively zero-cost and enabling constant folding in callers.
     */
    private static double calculateCircleArea(double radius) {
        return Math.PI * radius * radius; // compiles to ~8 bytecodes
    }

    /**
     * Artificially inflated method — same logic, but padded with
     * unnecessary intermediate variables that push bytecode size past
     * the inlining threshold. In real code this happens with logging,
     * null checks, and defensive validation scattered through hot methods.
     */
    private static double calculateCircleAreaVerbose(double radius) {
        double pi          = Math.PI;
        double radiusSquared = radius * radius;
        double rawArea     = pi * radiusSquared;
        double roundedArea = Math.round(rawArea * 1_000_000.0) / 1_000_000.0;
        // Extra branching and method calls push this over MaxInlineSize
        if (Double.isNaN(roundedArea) || Double.isInfinite(roundedArea)) {
            throw new ArithmeticException("Invalid radius produced non-finite area: " + radius);
        }
        return roundedArea;
    }

    private static long benchmarkSmallMethod(int iterations) {
        double accumulator = 0.0; // accumulate to prevent dead-code elimination
        long startNanos = System.nanoTime();

        for (int i = 1; i <= iterations; i++) {
            double radius = i * 0.001; // different radius each time prevents constant folding
            accumulator += calculateCircleArea(radius);
        }

        long elapsedNanos = System.nanoTime() - startNanos;
        // Print accumulator so the JIT can't eliminate the loop
        System.out.printf("  Small method total area sum: %.2f%n", accumulator);
        return elapsedNanos;
    }

    private static long benchmarkVerboseMethod(int iterations) {
        double accumulator = 0.0;
        long startNanos = System.nanoTime();

        for (int i = 1; i <= iterations; i++) {
            double radius = i * 0.001;
            accumulator += calculateCircleAreaVerbose(radius);
        }

        long elapsedNanos = System.nanoTime() - startNanos;
        System.out.printf("  Verbose method total area sum: %.2f%n", accumulator);
        return elapsedNanos;
    }

    public static void main(String[] args) {
        int warmupIterations = 200_000;
        int benchIterations  = 2_000_000;

        // Warmup phase — let both methods reach C2 compilation tier
        System.out.println("Warming up JIT (both methods to Tier 4)...");
        benchmarkSmallMethod(warmupIterations);
        benchmarkVerboseMethod(warmupIterations);

        // Actual benchmark
        System.out.println("\n--- Benchmark (" + benchIterations + " iterations each) ---");

        long smallNanos   = benchmarkSmallMethod(benchIterations);
        long verboseNanos = benchmarkVerboseMethod(benchIterations);

        System.out.printf("%n  Small method (likely inlined):   %,d ms%n", smallNanos / 1_000_000);
        System.out.printf("  Verbose method (may not inline): %,d ms%n", verboseNanos / 1_000_000);
        System.out.printf("  Overhead factor: %.2fx%n",
            (double) verboseNanos / smallNanos);
        System.out.println("\nCheck PrintInlining output for '@ X callee is too large' to confirm.");
    }
}

▶ Output

Warming up JIT (both methods to Tier 4)...
Small method total area sum: 4188786.45
Verbose method total area sum: 4188786.45

--- Benchmark (2,000,000 iterations each) ---
Small method total area sum: 418878645247.28
Verbose method total area sum: 418878645247.28

Small method (likely inlined): 8 ms
Verbose method (may not inline): 31 ms
Overhead factor: 3.87x

Check PrintInlining output for '@ X callee is too large' to confirm.

(With -XX:+PrintInlining you'll see:)
@ 4 InliningThresholdDemo::calculateCircleArea (8 bytes) inline (hot)
@ 4 InliningThresholdDemo::calculateCircleAreaVerbose (67 bytes) callee is too large

🔥

Interview Gold: Why Getters Should Be TinyThis is the real engineering reason to keep getters and utility methods concise — it's not style, it's JIT physics. A getter that's 5 bytecodes inlines everywhere it's called. Add a null check, a log statement, and a metrics increment and it might cross MaxInlineSize. Now every call to it carries the overhead of a real method call plus it blocks all cross-boundary optimizations in the caller. Profile first, but understand why the boundary matters.

Production JIT Gotchas: Warmup Strategies, OSR, and the Flags That Actually Matter

On-Stack Replacement (OSR) is a JIT feature you've almost certainly benefited from without knowing its name. Normally, a method is compiled and the next invocation runs the compiled version. But what about a method with a loop that runs for ten million iterations in a single call? Without OSR, you'd interpret all ten million iterations because the method never returns to get recompiled. OSR solves this by replacing the executing method frame mid-execution — the JIT compiles the method while it runs and swaps the stack frame to the compiled version at a loop back-edge. OSR-compiled code is slightly less optimal than normal JIT-compiled code because the frame layout must match the interpreter's at the replacement point, limiting some optimizations.

For microservices and serverless, warmup is an existential problem. Your JIT hasn't seen enough traffic to compile the hot paths, so your first thousand requests are slow — potentially violating SLAs. Three production strategies work: (1) Replay-based warmup using recorded traffic replayed at startup before the instance joins the load balancer. (2) Ahead-of-time profile injection using CDS (Class Data Sharing) or GraalVM's PGO (Profile-Guided Optimization), which serializes profiles from a training run. (3) JVM flags tuning — -XX:CompileThreshold=500 and -XX:Tier4InvocationThreshold=5000 lower thresholds at the cost of compiling with less profile data, which means slightly less optimal code but faster warmup.

GraalVM Native Image takes the opposite trade: it compiles everything AOT using Substrate VM, eliminating warmup entirely at the cost of peak throughput (no runtime profiles) and dynamic class loading.

OsrAndWarmupDemo.java · JAVA

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586

/**
 * Demonstrates On-Stack Replacement (OSR) by creating a single
 * method with a loop long enough to trigger JIT compilation
 * while the method is already executing.
 *
 * Run with:
 *   java -XX:+PrintCompilation -XX:+TraceOSR OsrAndWarmupDemo
 *
 * Look for '%' sign in PrintCompilation output — that's OSR compilation.
 * Normal compilations show no '%'. OSR compilations do.
 *
 * Example line:
 *   1234  42 %  4   OsrAndWarmupDemo::longRunningSetup @ 15 (88 bytes)
 *   The '@15' means the OSR entry point is at bytecode index 15 (the loop back-edge).
 */
public class OsrAndWarmupDemo {

    /**
     * This method is called ONCE but runs long enough that the JIT
     * compiles it mid-execution. Without OSR, we'd interpret all
     * 5 million iterations. With OSR, after ~10,000 iterations the
     * JIT compiles the method and swaps us into native code seamlessly.
     */
    private static double longRunningSetup(int iterationCount) {
        double runningTotal = 0.0;

        // This loop is the OSR candidate — it runs long enough in a single
        // method call to trigger compilation via back-edge counter
        for (int i = 1; i <= iterationCount; i++) {
            // Enough work that the loop body isn't trivially eliminated
            runningTotal += Math.sqrt(i) * Math.log1p(i);

            // Progress marker — remove in real code, just for demo visibility
            if (i == 10_000) {
                System.out.println("  [iteration 10,000] — JIT likely compiling this method NOW via OSR");
            }
            if (i == 50_000) {
                System.out.println("  [iteration 50,000] — now running in OSR-compiled native code");
            }
        }

        return runningTotal;
    }

    /**
     * Demonstrates the production warmup pattern: calling lightweight
     * versions of your hot paths during startup before accepting traffic.
     */
    private static void simulateWarmupPhase() {
        System.out.println("\n=== Warmup Phase (simulating pre-traffic JIT priming) ===");

        // In production this would be replayed production requests.
        // Here we call our hot method multiple times with representative inputs
        // so C2 compiles it before real users arrive.
        for (int warmupRound = 0; warmupRound < 5; warmupRound++) {
            double result = longRunningSetup(20_000); // short but enough to profile
            System.out.printf("  Warmup round %d result: %.2f%n", warmupRound + 1, result);
        }

        System.out.println("Warmup complete — instance ready to serve traffic.\n");
    }

    public static void main(String[] args) {
        System.out.println("=== Phase 1: Single Long-Running Method Call (OSR Demo) ===");
        System.out.println("Calling longRunningSetup ONCE with 5 million iterations.");
        System.out.println("Watch for OSR compilation to fire mid-execution:\n");

        long startNanos = System.nanoTime();
        double osrResult = longRunningSetup(5_000_000); // one call, very long loop
        long osrElapsedMs = (System.nanoTime() - startNanos) / 1_000_000;

        System.out.printf("OSR demo result: %.4f | Elapsed: %d ms%n", osrResult, osrElapsedMs);

        // Now show the production warmup pattern
        simulateWarmupPhase();

        System.out.println("=== Phase 2: Post-warmup benchmark (fully C2 compiled) ===");
        startNanos = System.nanoTime();
        double warmResult = longRunningSetup(5_000_000);
        long warmElapsedMs = (System.nanoTime() - startNanos) / 1_000_000;

        System.out.printf("Warm result: %.4f | Elapsed: %d ms%n", warmResult, warmElapsedMs);
        System.out.printf("Speedup after full warmup: %.1fx%n",
            (double) osrElapsedMs / Math.max(warmElapsedMs, 1));
    }
}

▶ Output

=== Phase 1: Single Long-Running Method Call (OSR Demo) ===
Calling longRunningSetup ONCE with 5 million iterations.
Watch for OSR compilation to fire mid-execution:

[iteration 10,000] — JIT likely compiling this method NOW via OSR
[iteration 50,000] — now running in OSR-compiled native code
OSR demo result: 15241435.7832 | Elapsed: 312 ms

=== Warmup Phase (simulating pre-traffic JIT priming) ===
Warmup round 1 result: 182762.43
Warmup round 2 result: 182762.43
Warmup round 3 result: 182762.43
Warmup round 4 result: 182762.43
Warmup round 5 result: 182762.43
Warmup complete — instance ready to serve traffic.

=== Phase 2: Post-warmup benchmark (fully C2 compiled) ===
Warm result: 15241435.7832 | Elapsed: 89 ms
Speedup after full warmup: 3.5x

(With -XX:+PrintCompilation, Phase 1 shows a line with '%' indicating OSR:
312 17 % 4 OsrAndWarmupDemo::longRunningSetup @ 15 (88 bytes))

⚠️

Watch Out: Microbenchmarks in main() Measure OSR, Not Peak PerformanceWhen you write a benchmark loop directly in main(), the JIT compiles it via OSR — an inherently less-optimized compilation mode. Your benchmark results look worse than production reality because OSR-compiled code has constraints normal compilations don't. Always use JMH for Java microbenchmarks. JMH drives the method into normal (non-OSR) compiled state by invoking it via a framework harness that triggers standard compilation before the measurement window opens.

Aspect	Interpreter	JIT (C1/C2 Tiered)	AOT (GraalVM Native)
Startup latency	Instant start, slow execution	Fast start, warming over ~10k invocations	Instant start, instant peak speed
Peak throughput	~10-50x slower than native	Near-native (within 5-20% of C)	Good but below JIT peak — no runtime profiles
Memory overhead	Low (no compiled code cache)	JIT code cache: typically 64-256 MB	Lowest — binary includes only reachable code
Dynamic class loading	Full support	Full support	Not supported — closed-world assumption
Profile-guided opts	None	Full — type profiles, branch frequencies	Partial — requires offline PGO training run
Deoptimization	N/A — nothing to deopt	Yes — on assumption violations	N/A — static binary, no speculative opts
Reflection support	Full	Full	Partial — requires config hints at build time
Ideal workload	Short scripts, startup-critical CLIs	Long-running services, throughput servers	Serverless, CLIs, latency-sensitive cold starts
Debugging/profiling	Easy	Moderate — async-profiler recommended	Hard — limited runtime introspection

🎯 Key Takeaways

The JIT's real power is not compilation — it's speculative optimization using runtime profiles. It inlines virtual calls that static compilers can never inline because it knows what type actually shows up 99% of the time.
Deoptimization is not a failure — it's a safety net that makes speculative optimization safe to deploy. The danger is silent deopt storms from late class loading or type profile changes during peak traffic.
Inlining is the master optimization: when a callee is inlined, constants propagate across the boundary, dead branches disappear, and heap allocations can become stack allocations. Your method's bytecode size (not line count) is what controls whether it inlines.
Never microbenchmark in a plain main() loop on the JVM. OSR compilation, dead-code elimination, and lack of proper warmup mean you're measuring the JIT's warm-up artifact, not your code's steady-state performance. JMH exists for a reason.

⚠ Common Mistakes to Avoid

✕Mistake 1: Writing JVM microbenchmarks in a plain main() loop — The JIT compiles the loop via On-Stack Replacement, which is less optimized than standard compilation, making results 2-5x slower than the real peak performance. Fix: Use JMH (Java Microbenchmark Harness). It drives methods into standard compiled state via repeated invocation before opening the measurement window, giving you true steady-state numbers.
✕Mistake 2: Assuming 'the JVM warms up in a few seconds' — Tier 4 (C2) compilation of all hot paths in a real application commonly takes 30,000–100,000 method invocations per method, and large applications have hundreds of hot methods. At 1,000 requests/second you might need 30+ seconds of real traffic to fully warm. Fix: Load-test with realistic traffic for at least 60 seconds before recording performance baselines, and implement an explicit warmup phase in your Kubernetes readiness probe that replays stored traffic before marking the pod ready.
✕Mistake 3: Touching -XX:CompileThreshold and -XX:MaxInlineSize without measuring — Developers lower CompileThreshold hoping for faster warmup but the JIT compiles with less profile data, meaning speculative inlining bets are wrong more often, causing more deoptimizations and ultimately worse peak throughput. Fix: Measure warmup time vs. peak throughput as a trade-off curve specific to your workload. Use -XX:+PrintCompilation and -XX:+PrintDeoptimization to count deopt events before and after flag changes. Only tune after you have data.

Interview Questions on This Topic

QWalk me through exactly what happens inside the JVM the first time a method is called, the 2,000th time, and the 15,000th time — specifically what the JIT does at each threshold and why tiered compilation exists instead of going straight to C2.
QWhat is deoptimization, when does it trigger in a production JVM, and how would you diagnose a latency spike that turned out to be caused by deoptimization rather than garbage collection?
QYou're asked to benchmark two string concatenation approaches — using '+' in a loop versus StringBuilder. You write a simple main() method with a for loop and find '+' is only 10% slower. A colleague says your benchmark is wrong. Who's right and why? (The trap: both are affected by OSR compilation and dead-code elimination — the JIT eliminates the intermediate strings if results aren't consumed, and OSR makes both paths slower than peak. The correct answer is 'use JMH and consume results with Blackhole'.)

Frequently Asked Questions

Why does my Java application get faster after running for a while?

This is JIT compilation kicking in. The JVM starts by interpreting your bytecode while collecting profiling data about which methods run most and what types they receive. Once a method crosses invocation thresholds (~2,000 for C1, ~15,000 for C2), the JIT compiles it to native machine code using those profiles for aggressive optimization. The process typically plateaus after 30-60 seconds of realistic traffic.

What's the difference between JIT compilation and AOT compilation?

JIT compiles code at runtime using actual execution profiles, enabling speculative optimizations like virtual call inlining that no static compiler can safely perform. AOT (like GraalVM Native Image) compiles everything to a native binary before execution, giving instant startup and no warmup cost but losing the ability to optimize based on actual runtime behavior. For long-running throughput servers, JIT typically wins on peak performance. For CLIs and serverless, AOT wins on startup latency.

Does the JIT compiler work differently for JavaScript than for Java?

The high-level strategy is similar — profile hot paths and compile them to native code — but the challenges differ dramatically. JavaScript is dynamically typed, so the JIT must profile type shapes of objects and deoptimize aggressively when shapes change. V8's Ignition interpreter feeds Turbofan with type feedback just as HotSpot's interpreter feeds C2. The key difference is that Java's static type system gives the JIT much stronger guarantees from the start, while JavaScript JITs must be far more defensive about deoptimization.

🔥

TheCodeForge Editorial Team Verified Author

Written and reviewed by senior developers with real-world experience across enterprise, startup and open-source projects. Every article on TheCodeForge is written to be clear, accurate and genuinely useful — not just SEO filler.

About Our Team Editorial Standards

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged