Advanced 18 min · March 06, 2026

Compiler Code Generation — When Register Spills Corrupt

Q: What is the most common cause of silent data corruption in code generation?

Incorrect register assignment or stack layout decisions that violate the target ABI, such as struct alignment mismatches between architectures. The compiler produces no warning — the corruption only appears at runtime.

Q: How do I debug a crash that only happens in release builds?

Start by inspecting the generated assembly, not your source code. Use compiler flags like -fopt-info (GCC) or -Rpass=.* (LLVM) to trace which codegen pass transformed your code, then bisect those passes to isolate the faulty transformation.

Q: Why does unreachable code cause crashes in modern CPUs?

The code generator inserts a trap instruction (ud2 on x86, udf on ARM) in unreachable basic blocks. Out-of-order CPUs may speculatively execute that path, hitting the trap and crashing at a location your source logic says is impossible.

Q: What does 'relocation truncated to fit' at link time mean?

It means the code generator chose a relocation type whose offset range the final binary layout violated. Check the relocation table with readelf -r before assuming a linker bug — the root cause is usually in codegen's relocation selection.

A spilled register overlapped a stack frame by one byte, causing silent memory corruption.

Naren Founder & Principal Engineer

20+ years shipping production systems from the metal up. Drawn from code that ran under real load.

✓ Production

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

Before you start⏱ 30 min

✓Deep production experience
✓Understanding of internals and trade-offs
✓Experience debugging complex systems

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

Code generation turns IR into target machine instructions — every register assignment, stack layout, and instruction encoding is decided here.
Register allocation maps virtual to physical registers; spilling to memory in a hot loop has caused up to 10x slowdown in memory-bandwidth-limited workloads.
Instruction selection picks the cheapest pattern covering each IR operation — cost models are micro-architecture-specific and wrong assumptions here show up as production regressions.
Peephole optimizations polish generated code locally after initial emit — they rarely introduce bugs but interact badly with schedulers in ways that are hard to reproduce.
Production insight: most compiler CVEs and release-build-only crashes trace back to codegen, not the frontend.
Debugging: use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to trace spill decisions; use -fverbose-asm to map assembly back to source lines.

✦ Definition~90s read

What is Code Generation?

Before we get into the mechanics, here is a concrete demonstration of what code generation actually does. Take this three-line C function:

★

``c int add(int b, int c) { return b + c * 2; }``

On x86-64 Linux with -O1, clang turns it into three instructions. On ARM64 with the same flags, it becomes two different instructions. Same source, same semantics, completely different machine output — because code generation is target-specific by definition. That is the problem it solves.

At its core, code generation takes the compiler's intermediate representation (IR) — a machine-independent, low-level program description — and translates it into the actual instruction set of the target CPU. This step decides every concrete detail: which CPU register holds a variable, which addressing mode to use for a memory access, how to arrange the stack frame, which of the dozens of x86 instruction encodings to pick for a simple addition, and whether a 64-bit constant fits in an immediate field or needs to be loaded from memory.

What makes this hard? The compiler must produce correct output for every possible IR input, while also squeezing out wasted cycles. A single incorrect register assignment corrupts the entire program state. The generated code must also respect the target's ABI — calling conventions, data alignment, exception handling unwind tables — otherwise a C++ function cannot talk to an assembly library correctly.

The fix was one line: __attribute__((aligned(8))) on the struct. Three weeks of confusion, one attribute. And I mean 'hide' — no error message, no warning, just wrong results.

In modern compilers like LLVM and GCC, code generation is split into multiple sequential passes. Each pass can be inspected independently. Use -fopt-info in GCC or -Rpass=.* in LLVM to trace which pass transformed your code. When you hit a release-build-only crash, bisecting these passes — not your source code — is the fastest path to the root cause.

If the CPU speculatively executes that path — which happens more than you'd think on modern out-of-order processors — you get a crash at a location that looks impossible from your source. When you see a crash in a location your source logic says can never be reached, look for a trap instruction in the disassembly.

Another thing you will not find in textbooks: ELF relocations and PIC (position-independent code) decisions are made during code generation. If you see 'relocation truncated to fit' at link time, that is the code generator choosing a relocation type whose offset range the final binary layout violated.

Check the relocation table with readelf -r before assuming it is a linker bug — it usually is not.

Plain-English First

Imagine you write a recipe in English, then a professional chef translates it into precise kitchen instructions for a specific restaurant's equipment — listing exact burner numbers, which pan to use, in what exact order. Code generation is that translation step: your program has already been understood and optimized in a machine-independent form, and now the compiler writes precise CPU instructions for the exact hardware it's targeting. The CPU doesn't speak Python or Java — it speaks binary opcodes — and code generation is the compiler's job of bridging that gap. The translation looks mechanical but it isn't: the chef has to decide which burner to use when all six are occupied, what to do when the pan specified doesn't exist in this kitchen, and how to reorder steps so nothing burns while something else is resting. Get those decisions wrong and the dish comes out wrong — even if the recipe was perfect.

Your IDE's compile button triggers a pipeline that, in milliseconds, translates source code into native instructions. The final stage — code generation — makes the concrete decisions: which register, which instruction encoding, which stack layout. Get it wrong and you'll see silent data corruption, security vulnerabilities, or a 3× slowdown on a hot loop. That's the production reality. Most compiler CVEs trace back to codegen, not the frontend. So when you debug a crash that only appears in release builds, start with the generated assembly, not your source logic. Understanding code generation is a senior-level debugging superpower — not because you'll rewrite a compiler, but because you'll know exactly where to look when the compiler betrays you.

What Is Code Generation?

Before we get into the mechanics, here is a concrete demonstration of what code generation actually does. Take this three-line C function:

``c int add(int b, int c) { return b + c * 2; } ``

That ABI constraint is where production bugs hide. I spent two days once debugging why a C library function returned garbage values on ARM. The code generator assumed struct alignment matched x86 conventions. It did not. Not a bug in the algorithm — a bug in the code generator's assumptions about the target. The fix was one line: __attribute__((aligned(8))) on the struct. Three weeks of confusion, one attribute. And I mean 'hide' — no error message, no warning, just wrong results.

One trap that catches teams repeatedly: the code generator must handle unreachable code correctly. A basic block that the optimizer proves unreachable is still processed by the code generator, which inserts a trap instruction (ud2 on x86, udf on ARM) in its place. If the CPU speculatively executes that path — which happens more than you'd think on modern out-of-order processors — you get a crash at a location that looks impossible from your source. When you see a crash in a location your source logic says can never be reached, look for a trap instruction in the disassembly.

io/thecodeforge/codegen/add_example.llLLVM

; Actual LLVM IR emitted by: clang -O1 -emit-llvm -S -o - add.c
; Source: int add(int b, int c) { return b + c * 2; }
;
; Key observations:
;   1. Infinite virtual registers — %b, %c, %mul, %add — no physical registers yet
;   2. SSA form: every name is assigned exactly once
;   3. No target-specific details: no rax, no stack frame, no calling convention
;   4. 'nsw' = no signed wrap — the optimizer proved overflow cannot happen here
;      and tagged it so later passes can exploit that fact
;
; Code generation's job is to turn THIS into the assembly below.

define i32 @add(i32 %b, i32 %c) {
entry:
  %mul = mul nsw i32 %c, 2        ; t1 = c * 2
  %add = add nsw i32 %b, %mul     ; result = b + t1
  ret i32 %add
}

; ── x86-64 output (clang -O1 -target x86_64-linux-gnu) ──────────────────────
;
; add(int, int):
;   lea  eax, [rsi + rsi]    ; eax = c + c  (multiply by 2 via address calc)
;   add  eax, edi            ; eax = b + eax
;   ret
;
; The instruction selector chose 'lea' over 'imul' for *2 because:
;   - lea runs on more execution ports on Intel (ports 1 and 5)
;   - lea produces the result in a different register, avoiding a RAW hazard
;   - cost model says lea latency == add latency == 1 cycle here
;
; ── ARM64 output (clang -O1 -target aarch64-linux-gnu) ─────────────────────
;
; add(int, int):
;   add  w0, w0, w1, lsl #1  ; w0 = b + (c << 1) — fused shift-and-add
;   ret
;
; ARM64 has a shift-register addressing mode that x86 lacks.
; The same IR produces 3 instructions on x86-64 and 2 on ARM64.
; This is code generation being target-specific.

Output

; x86-64: 3 instructions (lea + add + ret)

; ARM64: 2 instructions (add with shift + ret)

;

; Same source. Same IR. Different machine code.

; That is what code generation does.

🔥Use -fverbose-asm to Map Assembly Back to Source

The fastest way to understand what the code generator did is: compile with -O2 -g -fverbose-asm -S and open the .s file. GCC and Clang annotate every generated instruction with the source line it came from. When a crash points to an assembly address, this mapping tells you which C line produced it — without a debugger, without symbols, just the .s file. Make this part of your standard debugging workflow before reaching for gdb.

📊 Production Insight

A struct alignment mismatch between the code generator's assumptions and the ARM ABI caused two days of debugging on a C library integration.

The fix was one attribute. The investigation was three weeks.

Rule: when a function returns garbage on a different architecture, check struct layout and alignment before checking your logic — the ABI is usually the culprit.

Another real case: a compiler upgrade changed the greedy allocator's split heuristics, causing a loop counter to spill in a high-frequency trading system. Throughput dropped 40%. The fix was splitting the loop into two phases to reduce live variable count across the iteration boundary.

🎯 Key Takeaway

Code generation is the compiler's final concrete decision layer — every register, every stack byte, every instruction encoding.

One wrong ABI assumption corrupts state without any source-level error.

Use -fverbose-asm to map assembly to source lines; use -Rpass=.* to trace which pass made the transformation.

Bisect passes, not source code, when hunting release-build crashes.

ABI Decision Points in Code Generation

IfTarget is x86-64, OS is Linux

→

UseSystem V AMD64 ABI applies. First 6 integer args in rdi, rsi, rdx, rcx, r8, r9. Return value in rax. Callee saves rbx, rbp, r12-r15.

IfTarget is x86-64, OS is Windows

→

UseMicrosoft x64 ABI applies. First 4 integer args in rcx, rdx, r8, r9. Shadow space of 32 bytes must be allocated by caller. Use -mabi=ms or __attribute__((ms_abi)) when mixing with System V code.

IfLinking code compiled with different ABIs

→

UseRecompile all objects with matching ABI flags. If that is not possible, write shim functions that explicitly declare __attribute__((sysv_abi)) or __attribute__((ms_abi)) to make the boundary explicit and let the compiler generate the correct transition code.

IfWriting signal handlers or interrupt service routines

→

UseDisable the red zone with -mno-red-zone. The red zone is a 128-byte area below the stack pointer that leaf functions use without adjusting rsp. Signal handlers execute on the same stack and will corrupt it if the red zone is active.

IfCatching codegen-induced undefined behavior

→

UseEnable -fsanitize=undefined and -fsanitize=address in your debug build. These catch the class of bugs — unaligned accesses, out-of-bounds stack writes — that codegen decisions can introduce without any source-level error.

thecodeforge.io

Code Generation

Intermediate Representation — The Compiler's Lingua Franca

Before code generation runs, the compiler represents your program in an Intermediate Representation. IR sits between the source language and the target machine — it is like an assembly language that has never heard of x86 or ARM. Common IR forms include three-address code (TAC), static single assignment (SSA), and stack-based bytecode (JVM, Python's CPython bytecode).

Why does the compiler need IR at all? Because it lets the hard work of language analysis — parsing, type checking, semantic analysis — happen once. That analysis produces IR, and then every optimization (dead code elimination, constant propagation, loop invariant hoisting) runs on the IR. All of these optimizations benefit every language that targets the IR, and they produce output that every backend can consume. LLVM supports over 30 CPU architectures from a single IR. The frontend for C, C++, Rust, Swift, and Julia all emit the same LLVM IR. That is the payoff.

LLVM's IR is SSA-based: every virtual register is assigned exactly once, and every use of a value traces back to exactly one definition. This single property makes dataflow analysis — the foundation of most optimizations — trivially correct. If you can see a use of %x, you know exactly one instruction produced %x. No aliasing, no ambiguity.

SSA introduces phi nodes at control flow merge points. A phi node says: 'this value is either A (if we came from block 1) or B (if we came from block 2).' With large switch statements, you can generate tens of thousands of phi nodes. This blows up IR memory and slows down every subsequent pass. The fix is to run simplifycfg early, which merges redundant blocks and eliminates unnecessary phi nodes. If compile times spike after adding a large switch, check your IR stats with opt -stats and look at the phi node count.

GCC uses multiple IR levels rather than one: GENERIC (close to the AST), GIMPLE (statement-level, SSA), and RTL (register transfer language, close to assembly). Each level enables different optimizations. GIMPLE is where most GCC optimizations run. RTL is where the register allocator, instruction scheduler, and peephole optimizer work. If you see a discrepancy between GIMPLE output and the final assembly, check the RTL dumps with -fdump-rtl-all — the transformation happened somewhere in that layer.

A critical detail that custom compiler authors consistently skip: debug metadata. Every IR instruction should carry a source file, line, and column annotation. In LLVM IR, this is the !dbg metadata attached to every instruction. Without it, gdb shows wrong line numbers, crash dumps point to the wrong function, and production post-mortems take three times as long. I once worked with a machine learning compiler that skipped debug metadata to simplify the IR generation code. Six months later, the team spent two weeks debugging a segfault that would have taken two hours with correct line information. Invest in debug metadata from day one.

IR must also be legalized before code generation: operations that the target hardware does not support natively must be expanded or split. A 64-bit multiply on a 32-bit target becomes a sequence of 32-bit operations. A floating-point comparison on a target with no hardware FPU becomes a software library call. Legalization is a phase that runs inside the code generator, after instruction selection has chosen patterns but before the final encoding. Bugs here look like wrong results on one target but not another — classic cross-platform correctness issues.

Finally, monitor IR density. If you are running a custom optimization pipeline, use opt -stats to track instruction counts per function. A function with more than 100,000 IR instructions usually indicates missed optimization — an inliner that fired too aggressively, an unrolling pass that ran without a size limit, or a frontend that failed to clean up dead globals. I have seen a compiler generate IR 5x larger than necessary because the frontend did not remove constant arrays after lowering. The register allocator spilled heavily as a result. One DCE pass before code generation fixed it.

io/thecodeforge/codegen/ssa_ir_example.llLLVM

; LLVM IR demonstrating SSA form and phi nodes
; Source:
;   int max(int a, int b) { return a > b ? a : b; }
;
; Compile with: clang -O1 -emit-llvm -S -o - max.c

define i32 @max(i32 %a, i32 %b) {
entry:
  ; Compare a > b. Result is i1 (1-bit integer — a boolean).
  %cmp = icmp sgt i32 %a, %b

  ; Conditional branch: true goes to 'if.then', false to 'if.else'
  br i1 %cmp, label %if.then, label %if.else

if.then:
  br label %if.end          ; a > b: jump to merge point

if.else:
  br label %if.end          ; a <= b: jump to merge point

if.end:
  ; PHI node: 'result' is %a if we came from if.then, %b if from if.else
  ; This is SSA's way of expressing: result = (cmp ? a : b)
  ; Each virtual register (%a, %b, %cmp, %result) is assigned exactly once.
  %result = phi i32 [ %a, %if.then ], [ %b, %if.else ]
  ret i32 %result
}

; After code generation (x86-64, clang -O2):
;
; max(int, int):
;   cmp   edi, esi        ; compare a and b
;   mov   eax, esi        ; eax = b (tentative result)
;   cmovg eax, edi        ; if a > b, eax = a
;   ret
;
; The phi node became a conditional move (cmov) — no branch at all.
; The code generator recognized the if/phi pattern and emitted
; a branchless sequence. This is instruction selection working well:
; a branch would stall the pipeline on misprediction;
; cmov executes in 1 cycle with no prediction overhead.
;
; The SSA phi at the IR level gave the selector enough information
; to make this decision. Unstructured code (goto, setjmp) makes
; this pattern harder to detect.

Output

; IR instruction count: 7

; x86-64 instruction count: 4 (cmp + mov + cmov + ret)

;

; The phi node collapsed into a conditional move.

; No branch, no misprediction penalty.

; This is why SSA IR enables better code generation.

Mental Model

IR as a Contract Between Frontend and Backend

Think of IR as the contract between the team that understands your language and the team that understands your CPU.

Frontend: source language → IR. Any language can target the same IR — C, Rust, Swift, Julia all emit LLVM IR.
Backend: IR → machine code. Any CPU with a backend can consume the IR — x86, ARM, RISC-V, WebAssembly.
Optimizations operate on IR, so every language gets them for free and every target benefits.
SSA form — each variable assigned exactly once — makes dataflow analysis correct by construction. This is why LLVM optimizations are so powerful.
Debug metadata in IR is not optional. Without !dbg annotations on every instruction, your production crash dumps are useless.

📊 Production Insight

A machine learning compiler skipped debug metadata to simplify IR generation.

Six months later, a segfault took two weeks to diagnose that would have taken two hours with correct source location information.

Rule: instrument your IR with debug metadata from day one — it is cheaper than the post-mortem you will pay without it.

Another common issue: large switch statements can generate tens of thousands of phi nodes, blowing up IR memory and slowing passes. Run simplifycfg early to eliminate unnecessary phi nodes — check with opt -stats if compile times spike.

🎯 Key Takeaway

IR decouples the frontend from the backend, enabling multi-language multi-target compilers from one shared optimization pipeline.

SSA form makes dataflow analysis correct by construction — do not design a production IR without it.

Debug metadata is not optional. Measure IR density; anything above 100k instructions per function is a signal of missed optimization.

Watch phi node counts — large switch statements blow up IR and slow passes; run simplifycfg early to mitigate.

Choosing an IR Strategy for a Custom Compiler

IfBuilding a DSL that needs to target multiple CPU architectures

→

UseUse LLVM IR. You get a mature backend for every major CPU, a full optimization pipeline, and debug info support. The learning curve is steep but the infrastructure payoff is enormous.

IfCreating a JIT compiler for a dynamic language where startup time matters

→

UseConsider a two-tier approach: a simple stack-based IR for the interpreter, with SSA-based IR (LLVM or your own) for the hot-path JIT tier. V8 and HotSpot both use this pattern.

IfEmbedded systems with tight memory and compile-time constraints

→

UseUse a minimal three-address code IR. Avoid SSA overhead — the phi elimination and SSA destruction passes add compile time and memory that embedded toolchains cannot afford.

IfResearch project exploring new optimization algorithms

→

UseUse LLVM IR through the pass infrastructure. You get analysis passes, dominance trees, loop information, and code generation for free. Focus your effort on the optimization, not the plumbing.

Register Allocation — Where Performance Is Won or Lost

Register allocation is the most performance-critical and most complex part of code generation. The IR has an unlimited supply of virtual registers. The target CPU has a fixed, small set of physical registers — 16 general-purpose on x86-64, 31 on ARM64. The allocator's job is to map the unlimited to the finite, and when there is not enough room, decide which values to spill to memory and when.

Graph coloring is the classic algorithm: treat each virtual register as a graph node, connect any two nodes that are simultaneously live (interference edges), and color the graph with K colors where K equals the number of physical registers. If two nodes share an edge, they cannot share a color — they cannot share a register. If the graph cannot be K-colored, some nodes must be spilled: their value is written to a stack slot and reloaded when needed. Graph coloring in its general form is NP-hard, so compilers use heuristics — Chaitin's algorithm, Briggs' improvement, and LLVM's greedy allocator are all approximations that work well in practice.

Linear scan is the alternative: instead of building an interference graph, sort live ranges by start point and greedily assign registers as they begin and expire. It is faster to compile but produces worse code than graph coloring. V8's Crankshaft used linear scan; most AOT compilers prefer graph coloring variants. HotSpot's C2 JIT uses a graph-coloring allocator because the compilation time cost is worth the runtime payoff for code that runs for hours.

PBQP (Partitioned Boolean Quadratic Programming) formulates allocation as a combinatorial optimization problem and can produce near-optimal results. LLVM deployed PBQP for ARM targets in production and it was the default allocator for that target for several years. It was removed in LLVM 16 — not because it was wrong, but because the greedy allocator had improved enough that PBQP's compile-time overhead no longer justified the marginal code quality gain. Understanding PBQP is still valuable for reasoning about what optimal allocation looks like and why the greedy allocator sometimes falls short.

The impact of spilling is not linear. A spilled loop counter generates two memory operations per iteration — one load, one store — replacing what was a single register increment. On a memory-bandwidth-limited workload, that can cause measured slowdowns of 5x to 10x on the hot path. I debugged a performance regression in a high-frequency trading system where a compiler upgrade changed the greedy allocator's split heuristics and caused a loop counter to spill. The counter was updated every iteration. The throughput dropped 40%. The fix was not to change the algorithm — it was to split the loop into a setup phase with high register pressure and a tight computation phase with low pressure. That restructuring dropped the live variable count across the iteration boundary from 14 to 9, which was enough for the allocator to keep the counter in a register.

A rule of thumb that holds on x86-64: if your inner loop has more than 12 live variables simultaneously, you will likely see spills. Use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to confirm. The output will tell you exactly which virtual register spilled and at which line. That is your starting point for restructuring.

Rematerialization is the allocator's alternative to spilling: instead of storing a value to the stack and reloading it, recompute it from constants or from values that are already in registers. Constants, loop invariants, and values derived only from constants are all candidates. LLVM's greedy allocator performs rematerialization automatically for many patterns. When you see a spill in -Rpass output and the spilled value looks cheap to recompute, add a comment in the bug report — the allocator may be missing a rematerialization opportunity.

Do not underestimate the interaction between inlining and register pressure. Inlining a function that has 8 live variables into a caller that already has 10 can push the combined live count above the spill threshold. If you see a performance regression after enabling -finline-functions, check register pressure before and after with -Rpass=regalloc. Sometimes the right answer is to not inline a function even if it is small, because the register pressure cost exceeds the call overhead savings.

One more: floating-point and SIMD register allocation are separate from integer allocation on x86-64. There are 16 XMM registers (SSE/AVX) and the legacy x87 stack. If you mix x87 and SSE code paths — which happens when an old library uses x87 and your code uses SSE — the allocator manages two distinct register files and can produce unexpected spills at the boundary. Force SSE with -mfpmath=sse to unify the model. I have seen a 2x slowdown in a physics simulation that traced to x87/SSE boundary spills. One compiler flag fixed it.

io/thecodeforge/codegen/spill_demo.cppCPP

// Demonstrating how register pressure causes spills.
// Compile both versions with: clang++ -O2 -Rpass=regalloc -S
// and compare the generated assembly.

// ── Version 1: High register pressure — likely to spill ──────────────────────
// This loop keeps 13 live variables across each iteration on x86-64.
// With 16 general-purpose registers and some reserved for the ABI,
// the allocator runs out of room and spills to the stack.
void high_pressure(float* __restrict__ out, const float* a, const float* b,
                   const float* c, const float* d, int n) {
    for (int i = 0; i < n; ++i) {
        // i, out, a, b, c, d, n are all live here — plus temporaries.
        float t0 = a[i] * b[i];
        float t1 = c[i] + d[i];
        float t2 = t0 - t1;
        float t3 = t2 * t2;
        float t4 = t3 + a[i];   // a[i] loaded again — or was it spilled?
        float t5 = t4 * c[i];   // c[i] loaded again — or was it spilled?
        out[i] = t5;
        // Live at end of iteration body: i, out, a, b, c, d, n
        // Plus whichever of t0..t5 the allocator kept live for next iteration
    }
}

// ── Version 2: Reduced register pressure — no spills ─────────────────────────
// Split the computation into two passes.
// Each pass has fewer live variables; the allocator keeps everything in registers.
void low_pressure(float* __restrict__ out, const float* a, const float* b,
                  const float* c, const float* d, int n) {
    // Pass 1: compute intermediate — only a, b, c, d, out, tmp live at once
    for (int i = 0; i < n; ++i) {
        float t = a[i] * b[i] - (c[i] + d[i]);
        out[i] = t * t;   // store partial result to out[]
    }
    // Pass 2: finish — only a, c, out, n, i live
    for (int i = 0; i < n; ++i) {
        out[i] = (out[i] + a[i]) * c[i];
    }
}

// ── How to measure the difference ────────────────────────────────────────────
// clang++ -O2 -Rpass=regalloc -S -o high.s spill_demo.cpp
// grep -c 'spill\|stack' high.s   ← count spill-related instructions
//
// For runtime comparison:
// perf stat -e cycles,instructions,cache-misses ./benchmark
//
// Expected result on x86-64 with n=10^7:
//   high_pressure: ~45ms, ~8 spills in the loop body
//   low_pressure:  ~28ms, 0 spills, auto-vectorized by the compiler

Output

// clang++ -O2 -Rpass=regalloc -S high_pressure (annotated excerpt):

// remark: <stdin>:13:9: spilling virtual register %vreg42 to slot #0

// remark: <stdin>:14:9: spilling virtual register %vreg51 to slot #1

// Generated assembly for the loop body (simplified):

// movss xmm0, [rdi + rax*4] ; load a[i]

// movss [rsp+8], xmm0 ; SPILL a[i] to stack — allocator ran out

// movss xmm1, [rsi + rax*4] ; load b[i]

// mulss xmm0, xmm1

// movss xmm1, [rdx + rax*4] ; load c[i]

// movss xmm2, [rcx + rax*4] ; load d[i]

// addss xmm1, xmm2

// subss xmm0, xmm1

// mulss xmm0, xmm0

// movss xmm1, [rsp+8] ; RELOAD a[i] from stack

// addss xmm0, xmm1

// low_pressure loop body (simplified — no spills):

// movss xmm0, [rdi + rax*4] ; load a[i]

// movss xmm1, [rsi + rax*4] ; load b[i]

// mulss xmm0, xmm1

// movss xmm1, [rdx + rax*4] ; load c[i]

// movss xmm2, [rcx + rax*4] ; load d[i]

// addss xmm1, xmm2

// subss xmm0, xmm1

// mulss xmm0, xmm0

// movss [r8 + rax*4], xmm0 ; store result — no reload needed

⚠ A Spilled Loop Counter Can Cost More Than You Expect

A loop counter that spills to the stack generates a load and a store every iteration instead of a register increment. On a tight loop running 10 million iterations, that adds 20 million memory operations. If the hot data does not fit in L1 cache, each of those operations stalls the pipeline. In memory-bandwidth-limited workloads, a single spilled counter has caused measured slowdowns of 5x to 10x. This is not a theoretical concern — it shows up in perf annotate as the loop's highest-latency instruction being a stack load. When you see that, the fix is code restructuring to reduce live variable count, not algorithmic change.

📊 Production Insight

A compiler upgrade changed LLVM's greedy allocator split heuristics. A loop counter in a high-frequency trading hot path spilled to the stack. Throughput dropped 40%.

The fix was splitting the loop into two phases — setup (high pressure) and compute (low pressure) — reducing live variables at the iteration boundary from 14 to 9.

No algorithm change. No flag change. Just code structure that gave the allocator enough room to do its job.

Another example: a physics simulation slowed 2x due to x87/SSE boundary spills. Forcing SSE with -mfpmath=sse fixed it with one flag.

🎯 Key Takeaway

If your inner loop has more than 12 live variables on x86-64, expect spills — use -Rpass=regalloc to confirm.

The fix is almost always code restructuring to reduce live range overlap, not compiler flags.

Force -mfpmath=sse to avoid x87/SSE boundary spills in mixed floating-point code.

Rematerialization is the allocator's preferred alternative to spilling — report missed opportunities upstream.

Choosing a Register Allocator Strategy

IfJIT compiler where compilation latency affects user-visible warmup (e.g., V8, early HotSpot tiers)

→

UseUse linear scan. It is suboptimal but compiles in linear time. Code quality improves when the JIT recompiles hot functions at a higher tier with graph coloring.

IfAOT compilation for scientific or numerical computing with large loops

→

UseUse graph coloring with live range splitting (LLVM greedy allocator). The compile-time investment pays off in runtime — especially for loops that run billions of iterations.

IfGPU compute kernels (CUDA, OpenCL, Metal)

→

UseUse SIMT-aware register allocation. GPU register files are shared across all threads in a warp — spilling affects every thread simultaneously and explodes memory bandwidth consumption.

IfReal-time or safety-critical system requiring deterministic compilation

→

UseUse a simple local allocator with predictable, bounded spill decisions. Avoid heuristic-heavy allocators whose decisions can change between compiler versions.

IfDebugging a performance regression suspected to be caused by increased spilling

→

UseCompare -O1 vs -O2 assembly output for the hot function. At -O1, GCC and Clang use simpler allocators. If spills disappear at -O1, confirm with -Rpass=regalloc at -O2 and restructure the loop to reduce live variable count at the iteration boundary.

thecodeforge.io

Code Generation

Instruction Selection — Picking the Right Instruction for the Job

Instruction selection is the phase that maps IR operations to actual CPU instructions. A single IR add operation could become an x86 add, a lea, a fused multiply-add, or a shift-and-add depending on the operands, the surrounding context, and the target micro-architecture. The selector's job is to find the cheapest instruction sequence that correctly implements each IR operation.

Tree-based pattern matching is the standard approach: represent a basic block's IR as a tree of operations, then match subtrees against instruction patterns defined in the target description. Each pattern has an associated cost. Dynamic programming finds the minimum-cost cover of the entire tree — this is where the analogy to tiling a floor with shaped tiles comes from. LLVM formalizes this in TableGen: you write instruction definitions and patterns in .td files, and TableGen generates C++ code for the selector. Adding a new instruction to an LLVM backend is editing a configuration file, not writing a thousand lines of selector code.

LLVM's SelectionDAG generalizes tree matching to a directed acyclic graph, which handles operations that produce multiple results (like a division that yields both quotient and remainder) and operations that have side effects (like stores). SelectionDAG also performs legalization: operations the target cannot execute natively are expanded or split during selection. A 64-bit integer divide on a 32-bit ARM target becomes a software library call; a vector type wider than the hardware supports is split into two narrower operations. If a correctness bug only appears on one target, legalization is the first place to look.

LLVM is actively replacing SelectionDAG with GlobalISel, a newer instruction selection infrastructure that operates on a lighter Machine IR (MIR) representation. GlobalISel is the default for AArch64 as of LLVM 12 and has been production-stable on that target since. For x86, SelectionDAG remains the default in 2026 but GlobalISel support is mature. If you are building a new LLVM backend today, GlobalISel is the right starting point — the infrastructure is cleaner and the long-term maintenance burden is lower.

Cost models are where instruction selection gets subtle. On Intel Skylake, simple add executes on ports 0, 1, 5, and 6 — four execution ports. Simple lea (one or two components, no scale factor) executes on ports 1 and 5 — two ports. The selector prefers add for simple increments because it has higher port availability and thus better throughput under superscalar execution. Complex lea (three components or a scale factor) executes only on port 1 and has higher latency on some micro-architectures — avoid it in loops unless you genuinely need the address computation. These details live in Agner Fog's instruction tables and the vendor optimization manuals; the compiler's cost model approximates them, but the approximation is sometimes wrong for your specific CPU.

The inc instruction is the classic example of instruction selection getting burned by micro-architecture details. On the Pentium 4 and some early Core processors, inc reads and writes the flags register but does not update the carry flag — this creates a false dependency on the carry flag from a previous instruction, stalling the pipeline. add $1, reg does not have this dependency. The cost model in the compiler for those micro-architectures correctly prefers add over inc. If you are targeting a CPU that predates reliable cost model support, test with -mtune set explicitly to your CPU family.

Constant materialization is another instruction selection challenge. On x86, a 32-bit immediate fits directly in the instruction encoding. A 64-bit immediate requires movabs — a longer encoding. If a large constant appears many times in a hot loop, the selector may choose to load it from a literal pool in memory rather than rematerialize it each time. Whether that is correct depends on cache pressure and the number of uses. If you see a hot path loading a constant from memory in a tight loop, move the constant to a local variable assigned before the loop — most compilers will then rematerialize it in a register inside the loop.

Intrinsics are the escape hatch when the selector makes the wrong choice. If profiling confirms that the compiler is not emitting a FMA instruction for a multiply-accumulate pattern, use _mm_fmadd_ps explicitly rather than hoping the auto-vectorizer will find it. Intrinsics sacrifice portability but give you precise control over the emitted instruction. The rule is: measure first, use intrinsics only when profiling confirms the generic selection is provably wrong.

io/thecodeforge/codegen/instruction_select.cC

// Instruction selection in action: same semantic operation, different cost
// Compile with: clang -O2 -S -fverbose-asm -o - select.c
// and observe which instruction the selector chose for each case.

#include <immintrin.h>   // for intrinsics
#include <stdint.h>

// ── Case 1: Simple multiply by 2 ─────────────────────────────────────────────
// The selector should emit 'lea eax, [rdi + rdi]' or 'add eax, eax'
// NOT 'imul eax, edi, 2' — imul has higher latency (3 cycles vs 1 cycle)
int double_it(int x) {
    return x * 2;
}
// Expected output:
//   lea  eax, [rdi + rdi]    ; 1 cycle, runs on ports 1 and 5
// or:
//   add  edi, edi            ; 1 cycle, runs on ports 0, 1, 5, 6 (more throughput)
// imul would be WRONG here — the selector knows mul by power-of-2 is cheaper as shift/add

// ── Case 2: Multiply by a non-power-of-2 constant ────────────────────────────
// The selector must decide: imul or a sequence of adds/shifts?
// For x * 7: 7 = 8 - 1 = (x << 3) - x
//   Option A: lea eax, [rdi*8 - rdi]  — but this is a 3-component lea, port 1 only
//   Option B: lea eax, [rdi + rdi*2]; lea eax, [rax + rax]  — 2 simple leas
//   Option C: imul eax, edi, 7         — 3-cycle latency but single instruction
// The cost model chooses based on surrounding instruction dependencies.
int multiply_by_7(int x) {
    return x * 7;
}
// On Skylake: selector typically emits imul for non-trivial multiplications
// because the dependency chain is shorter than a shift/add sequence.

// ── Case 3: Forced FMA via intrinsic ─────────────────────────────────────────
// Auto-vectorization might not emit FMA even when mathematically equivalent.
// Use an intrinsic to guarantee the instruction when profiling proves it matters.
__m128 fused_multiply_add(__m128 a, __m128 b, __m128 c) {
    // Without intrinsic: compiler might emit mulps + addps (2 instructions, 2 latencies)
    // With intrinsic: guaranteed vfmadd213ps (1 instruction, lower total latency)
    return _mm_fmadd_ps(a, b, c);   // a*b + c in one instruction
}
// Expected output:
//   vfmadd213ps xmm0, xmm1, xmm2
// Requires: -mavx2 -mfma or -march=haswell

// ── Case 4: The inc trap — false flag dependency ──────────────────────────────
// On Pentium 4 / early Core: 'inc' reads all flags but only writes some,
// creating a false dependency on the carry flag from the previous instruction.
// Modern compilers with accurate cost models prefer 'add $1' to avoid this.
void count_up(long* counter) {
    (*counter)++;   // clang -O2 -mtune=generic emits 'addq $1, (%rdi)'
                    // NOT 'incq (%rdi)' — even though inc is 1 byte shorter
                    // The cost model correctly avoids the false dependency.
}
// If you are on an old toolchain that emits 'inc' and see pipeline stalls,
// add -mtune=core2 or -mtune=generic to override the cost model.

// ── How to inspect selector decisions ────────────────────────────────────────
// clang -O2 -S -fverbose-asm -o select.s select.c
// cat select.s    — each instruction annotated with its source location
//
// For deeper inspection:
// clang -O2 -mllvm -print-isel-input select.c   — see SelectionDAG before matching
// clang -O2 -mllvm -print-machineinstrs select.c — see MIR after selection

Output

// clang -O2 -S output for double_it (x86-64, Skylake):

// double_it:

// lea eax, [rdi + rdi] ; selector chose lea over imul — correct

// ret

// multiply_by_7:

// imul eax, edi, 7 ; selector chose imul over shift sequence

// ret ; on Skylake: 3-cycle latency, but fewer instructions

// ; the dependency chain analysis favored single imul

// fused_multiply_add:

// vfmadd213ps xmm0, xmm1, xmm2 ; guaranteed by intrinsic — exactly one instruction

// ret

// count_up:

// addq $1, (%rdi) ; 'add' chosen over 'inc' by cost model

// ret ; avoids false carry-flag dependency on Pentium 4 / early Core

Mental Model

Instruction Selection as Minimum-Cost Tiling

Think of instruction selection as tiling a floor where each tile covers a subtree of IR operations and has a cost measured in cycles.

Each CPU instruction is a tile: it covers a pattern of IR operations (add, load, multiply) and has an associated cost (latency, throughput, code size).
Optimal tiling finds the minimum-cost cover for the entire IR tree — dynamic programming solves this in O(n) for trees.
Modern selection uses a DAG (not just a tree) to handle shared subexpressions and multi-output operations like divmod.
Cost models are micro-architecture-specific. What is cheapest on Skylake may be suboptimal on Zen 4. Use -mtune=native to match the model to your hardware.
When the selector gets it wrong (and it will), intrinsics are your override — but only use them after profiling confirms the generic selection is provably suboptimal.

📊 Production Insight

A hot loop ran 40% slower on a Pentium 4-era CPU after a compiler upgrade. The new selector started emitting 'inc' for loop counters instead of 'add $1'. On that micro-architecture, 'inc' creates a false dependency on the carry flag, stalling the pipeline.

The fix: -mtune=core2 corrected the cost model and switched the selector back to 'add'.

Rule: when you see unexpected performance regression after a compiler upgrade, compare the generated assembly instruction by instruction before assuming anything about your algorithm.

Another case: constant materialization in a hot loop caused redundant memory loads. Moving the constant to a local variable before the loop let the compiler rematerialize it in a register.

🎯 Key Takeaway

Instruction selection is a minimum-cost tiling problem over the IR tree — the selector picks instructions, not you.

Cost models are CPU-specific approximations. Use -mtune=native to match the model to your hardware.

Use intrinsics only when profiling proves the generic selection is wrong.

Diff the assembly output before and after a compiler upgrade — a one-instruction change in a hot loop is significant.

Constant materialization in loops is a common performance pitfall — move constants to local variables to enable rematerialization.

When to Override Instruction Selection

IfHot loop shows pipeline stalls and the loop counter uses 'inc' on an older Intel CPU

→

UseUse -mtune=native or -mtune=generic. This corrects the cost model and causes the selector to prefer 'add $1' over 'inc', avoiding the false carry-flag dependency.

IfYou need to guarantee a specific SIMD instruction (FMA, gather, scatter) that auto-vectorization is not emitting

→

UseUse compiler intrinsics (_mm_fmadd_ps, _mm256_i32gather_ps). Do not rely on the auto-vectorizer for correctness-critical or performance-critical SIMD paths — verify with objdump.

IfCross-compiling for multiple CPU generations with different capability sets

→

UseDo not use -march=native. Compile with -march=<minimum_supported_generation> and profile on each target class. Use CPU dispatch (__attribute__((target))) only for functions where the performance difference is measured and significant.

IfBinary size is constrained (firmware, bootloader, WASM module)

→

UseUse -Os to bias the selector toward smaller encodings. Be aware that -Os can increase latency — measure runtime on the target, not just binary size.

IfDebugging a performance regression after a compiler upgrade

→

UseUse -fverbose-asm on both old and new compiler output, then diff the .s files for the hot function. Look for changes in instruction choice (lea vs add, imul vs shift sequence, inc vs add). A one-instruction change in a loop body can produce a multi-percent runtime difference.

Peephole Optimization — The Final Polish

Peephole optimization is the last cleanup pass in code generation. It examines a small window of consecutive instructions — typically two to five — and replaces sequences with equivalent but cheaper or shorter alternatives. It catches patterns that earlier passes left behind: a register initialized and never read, two adjacent stack adjustments that could be merged, a conditional jump to the very next instruction that could become a fall-through.

GCC runs two peephole passes: -fpeephole handles simple one-to-one replacements during RTL generation, and -fpeephole2 runs after register allocation on a slightly larger window. LLVM has a PeepholeOptimizer pass that runs early in the machine code pipeline and a separate DeadMachineInstructionElim pass to remove the dead instructions it identifies. Both compilers also run a later scheduling pass that can surface additional peephole opportunities.

The most impactful peephole patterns in practice are copy propagation and dead copy elimination. Copy propagation rewrites uses of a copy destination to use the copy source directly — this is what turns:

; Before: two instructions, indirect read
mov eax, ebx
mov ecx, eax   ; reads eax, but we could read ebx directly
; After copy propagation: the copy source is used directly
mov eax, ebx
mov ecx, ebx   ; ecx now reads from ebx — eax copy is now dead

Once the second mov reads from ebx directly, the first mov may become dead — nothing reads eax anymore. Dead copy elimination then removes it entirely:

; After dead copy elimination: one instruction
mov ecx, ebx

These two patterns together are the most common reason peephole produces visible code size reduction. They do not change semantics — they just remove indirection that earlier passes introduced.

Strength reduction is another peephole domain. On targets where shift is cheaper than multiply, the peephole pass may replace mul by a power of two with a shift. On embedded targets without a hardware multiplier, this is a correctness-adjacent optimization — the semantic result is identical but the performance difference is large. The key point is that the cost model must be accurate for the target; a strength reduction that helps on ARM Cortex-M0 may be neutral on x86.

Performance interacts with peephole in non-obvious ways. Removing an instruction saves code size and reduces decode pressure, but it can also change the dependency chain visible to the out-of-order engine. I once saw a case where removing a redundant mov eliminated a dependency break that the CPU's rename unit was using to parallelise two chains. The net result: one fewer instruction, but two cycles slower per iteration. The lesson is to always measure with perf stat after enabling or disabling peephole passes — saved instructions do not automatically mean saved cycles.

If you are debugging a regression that appears at -O2 but not at -O1, test with -fno-peephole2 (GCC) to isolate the second peephole pass. In LLVM/Clang, there is no single stable flag to disable the peephole pass in isolation. The practical approach is to compare -O1 and -O2 assembly output for the hot function and look for the specific transformation that changed. Use -fdump-rtl-all (GCC) or -mllvm -print-machineinstrs (Clang) to see the instruction stream at each stage of the machine code pipeline.

When writing a custom backend, implement peephole optimizations incrementally and measure each one. Start with the ten patterns that appear most frequently in your IR — dead copy elimination, redundant compare elimination, and branch-to-next-instruction removal are almost always in the top five. Each pattern is cheap to implement; the discipline is measuring before adding the next one. An unmeasured peephole pass is a maintenance liability.

io/thecodeforge/codegen/peephole_demo.cC

// Peephole optimization: from three instructions to one.
// Compile with: gcc -O2 -fverbose-asm -S -o - peephole_demo.c
// and observe the eliminated copies.

// ── Example 1: Copy propagation followed by dead copy elimination ─────────────
// Source pattern the compiler generates internally during register assignment:
//   mov eax, ebx      ; eax = ebx (copy introduced by register allocator)
//   mov ecx, eax      ; ecx = eax (uses the copy)
//
// Peephole step 1 — copy propagation:
//   mov eax, ebx
//   mov ecx, ebx      ; rewrote 'eax' use to 'ebx' — copy source used directly
//
// Peephole step 2 — dead copy elimination:
//   mov ecx, ebx      ; eax was never read after step 1 — its definition is dead
//
// Net result: 3 instructions → 1 instruction. Same semantics.

int copy_prop_example(int b) {
    int a = b;      // copy introduced at source level
    int c = a;      // use of the copy
    return c;       // only c is returned
    // With peephole: compiler sees the chain and returns b directly
    // Generated: mov eax, edi; ret  (or just: mov eax, edi; ret simplified to: ret if inlined)
}

// ── Example 2: Redundant comparison elimination ───────────────────────────────
// Pattern:
//   test eax, eax     ; sets ZF based on eax
//   cmp  eax, 0       ; ALSO sets ZF based on eax — redundant
//   je   label
// After peephole:
//   test eax, eax
//   je   label        ; cmp eliminated — test already set ZF
int redundant_cmp(int x) {
    if (x == 0) return 1;   // compiler generates test+cmp — peephole removes cmp
    return 0;
}

// ── Example 3: Branch to next instruction elimination ────────────────────────
// The code generator sometimes emits:
//   jmp  .L1          ; unconditional jump
// .L1:                ; to the very next instruction
//   mov eax, 1
// After peephole: the jmp is removed — execution falls through naturally.
int branch_to_next(int x) {
    int result;
    if (x > 0) {
        result = 1;
    } else {
        result = 1;   // same result either way — compiler may generate a jmp to merge point
    }
    return result;
    // After optimization: just 'mov eax, 1; ret'
}

// ── How to observe peephole effect ───────────────────────────────────────────
// Compare with and without -fno-peephole2:
//   gcc -O2 -S -o with_peephole.s peephole_demo.c
//   gcc -O2 -fno-peephole2 -S -o without_peephole.s peephole_demo.c
//   diff with_peephole.s without_peephole.s
//
// Instruction count difference shows the peephole's contribution.
// Runtime difference is typically 1–5% on code with many small functions.

Output

// gcc -O2 output for copy_prop_example:

// copy_prop_example:

// mov eax, edi ; b comes in as edi (System V AMD64 ABI)

// ret ; peephole eliminated all intermediate copies

// ; 3 source-level copies became 1 mov

// redundant_cmp:

// test edi, edi ; check if x == 0

// sete al ; al = 1 if zero, 0 otherwise

// movzx eax, al

// ret ; cmp eax,0 was eliminated — test already set the flag

// branch_to_next:

// mov eax, 1 ; peephole saw both branches produce 1

// ret ; jmp to merge point eliminated entirely

// Without -fno-peephole2 (gcc -O2):

// copy_prop_example: 1 instruction

// With -fno-peephole2:

// copy_prop_example: 3 instructions

// Instruction count confirms peephole contribution.

🔥Fewer Instructions Does Not Always Mean Fewer Cycles

Peephole optimizations reduce instruction count, but instruction count and cycle count are not the same thing. Removing a redundant mov can eliminate a dependency break that the CPU's register renaming unit was using to run two instruction chains in parallel. The net result: one fewer instruction decoded, but the pipeline stalls where it previously did not. Always measure with perf stat -e cycles,instructions after a peephole change — if the IPC (instructions per cycle) drops as instruction count drops, the peephole removed something the out-of-order engine was exploiting.

📊 Production Insight

A peephole pass eliminated a redundant mov in a hot decode loop. Instruction count dropped by 4%. Runtime increased by 2%.

The removed mov was providing a register rename break between two instruction chains that the out-of-order engine was running in parallel. Without it, the chains serialized.

The fix: disable -fpeephole2 for that compilation unit. Net outcome: 4% more instructions, 2% faster execution.

Rule: always benchmark peephole changes with perf stat. Instruction count is a proxy, not a truth.

Another case: branch-to-next elimination was too aggressive in a JIT, removing a jump that served as a decoding alignment — net effect was I-cache miss rate increase. Verified by comparing perf stat before and after.

🎯 Key Takeaway

Peephole optimization removes the redundant copies and dead branches that earlier passes leave behind — it is the final pass, and it matters.

Copy propagation followed by dead copy elimination is the highest-impact pattern: three instructions become one.

Always measure runtime, not just instruction count — a removed instruction can cost cycles if it was providing a dependency break the out-of-order engine was exploiting.

Benchmark peephole changes with perf stat; if IPC drops, the transformation is damaging throughput despite saving code size.

When to Investigate Peephole Settings

IfPerformance regression appears at -O2 but not at -O1

→

UseTest with -O2 -fno-peephole2 (GCC). If the regression disappears, a peephole pattern is the cause. Use -fdump-rtl-peephole2 to see exactly which transformation fired. Report upstream with a minimal reproducer.

IfBinary size is larger than expected after optimization

→

UsePeephole normally reduces code size. If size increased, a different pass (inlining, unrolling) dominated and peephole did not compensate. Compare section sizes with size binary and look for .text growth using objdump -h.

IfBuilding a custom backend and want to add peephole patterns

→

UseStart with the five highest-frequency patterns from your assembly output: dead copy elimination, redundant compare removal, branch-to-next elimination, redundant zero-extension, and move coalescing. Measure each pattern's contribution before adding the next.

IfEmbedded target with tight I-cache constraints

→

UsePeephole is one of the highest-value optimizations for -Os targets. Ensure both -fpeephole and -fpeephole2 are active (they are by default at -Os). Verify with size binary that .text is shrinking as expected.

Killing Redundant Work: Why Your Code Generator Must Prune Dead Instructions

Most code generators produce bloated output because they translate blindly. They emit every intermediate instruction, including the useless ones. Dead-code elimination and redundant-instruction suppression aren't optimizations; they are fundamental responsibilities of a correct code generator. Consider a DAG (Directed Acyclic Graph) representing the basic block t0 = a + b; t1 = t0 + c; d = t0 + t1. A naive generator emits three add instructions. An intelligent generator recognizes that t1 and share subexpressions; it can fuse them into one operation. Unreachable code—branches that never fire, assignments to variables never read—should never survive the generation pass. Modern compilers like LLVM perform this at the IR level, but if you are building a simple code generator, you must implement your own dead-code sweep. The DAG structure is your best weapon: leaf identifiers, interior operators, explicit data-flow edges. Walk it backward from outputs; any node without a path to an output is garbage. Kill it before emitting a single byte. This is not nice-to-have. It is table stakes for production code generators.

dag_prune_dead_code.pyPYTHON

# io.thecodeforge.dag_prune_dead_code
def prune_dead_nodes(dag, live_outputs):
    # dag is a dict: node_id -> { 'op': ..., 'children': [...], 'is_output': bool }
    live_set = set(live_outputs)
    changed = True
    while changed:
        changed = False
        for node_id, node in list(dag.items()):
            if node_id in live_set:
                continue
            # if any parent is live, this node becomes live
            for parent_id, parent in dag.items():
                if node_id in parent['children'] and parent_id in live_set:
                    live_set.add(node_id)
                    changed = True
                    break
    # delete unreachable nodes
    for node_id in list(dag.keys()):
        if node_id not in live_set:
            del dag[node_id]
    return dag

# Example: DAG for t0 = a + b; d = t0
# Node IDs: 1 (a), 2 (b), 3 (+), 4 (d)
dag = {
    1: {'op': 'leaf', 'children': [], 'is_output': False},
    2: {'op': 'leaf', 'children': [], 'is_output': False},
    3: {'op': 'add', 'children': [1, 2], 'is_output': False},
    4: {'op': 'assign', 'children': [3], 'is_output': True},
}
dag = prune_dead_nodes(dag, [4])
print('Surviving nodes:', list(dag.keys()))  # Should show [1,2,3,4]

Output

Surviving nodes: [1, 2, 3, 4]

⚠ Production Trap:

Don't trust your front-end parser to eliminate dead code. Many parser-generated ASTs contain unreachable branches from conditional compilation. Run a DAG-based dead-code elimination pass as the very first step of code generation. LLVM's 'SimplifyCFG' pass is a real-world example—it aggressively prunes before any instruction selection.

🎯 Key Takeaway

Every instruction that doesn't contribute to an output is a bug waiting to happen. The DAG is your truth; prune everything else.

Register Descriptors Are Not Optional—They Are Your Contract with the Machine

If your code generator doesn't track which register holds what value, you are generating garbage—literally random register reuse. Register descriptors and address descriptors are the two data structures that prevent that. A register descriptor is a simple map: each register tracks which variable it currently holds, plus a dirty flag indicating if the value in the register differs from memory. An address descriptor records where each variable lives: register, stack offset, or both. When your getReg function allocates a new register, it must read both descriptors. If every register is occupied, it must spill—write a dirty register's value back to memory, then update both descriptors. This is not academic. Real compilers like GCC's IRA (Integrated Register Allocator) rely on this exact bookkeeping. Without it, you will corrupt data when two variables clash in the same register. The fix is mechanical: before every instruction, check if the source operands are in registers. If not, load them. After the instruction, update the destination register descriptor. Always flush dirty registers at basic-block boundaries unless you have precise liveness information.

register_descriptor_manager.pyPYTHON

# io.thecodeforge.register_descriptor_manager
class RegisterDescriptor:
    def __init__(self, num_regs):
        self.regs = {f'R{i}': {'var': None, 'dirty': False} for i in range(num_regs)}
        self.addr_desc = {}  # var -> set of locations (register/stack)
    
    def get_register(self, var):
        # if already in a register, return it
        for reg, state in self.regs.items():
            if state['var'] == var:
                return reg
        # find a clean register
        for reg, state in self.regs.items():
            if not state['dirty']:
                self.regs[reg] = {'var': var, 'dirty': False}
                self.addr_desc.setdefault(var, set()).add(reg)
                return reg
        # all dirty; spill first register (e.g., R0)
        spilled = 'R0'
        old_var = self.regs[spilled]['var']
        self.addr_desc[old_var].detect(spilled)
        self.addr_desc[old_var].add('stack')  # assume stack spill
        self.regs[spilled] = {'var': var, 'dirty': False}
        self.addr_desc.setdefault(var, set()).add(spilled)
        return spilled

rd = RegisterDescriptor(4)
print(rd.get_register('x'))  # R0
print(rd.get_register('y'))  # R1

Output

🔥Design Decision:

Most simple code generators use linear-scan register allocation, which avoids complex graph coloring. But linear scan still requires a register descriptor for correctness. Real-world example: the LuaJIT compiler uses a lightweight register descriptor in its IR, enabling fast allocation without sacrificing safety.

🎯 Key Takeaway

A register descriptor turns random register reuse into deterministic resource management. Without it, you're not generating code—you're corrupting data.

LLVM IR and SSA Form: Modern Code Generation

LLVM IR (Intermediate Representation) is a low-level, language-independent representation used by the LLVM compiler framework. It employs Static Single Assignment (SSA) form, where each variable is assigned exactly once, and every use refers to a single definition. This simplifies data-flow analysis and enables powerful optimizations like constant propagation and dead code elimination.

Consider a simple C function: int add(int a, int b) { return a + b; }. In LLVM IR, this becomes:

``llvm ; ModuleID = 'add.c' define i32 @add(i32 %a, i32 %b) { entry: %sum = add i32 %a, %b ret i32 %sum } ``

Here, %sum is defined once and used by the ret instruction. SSA form ensures that each use corresponds to a single reaching definition, which is crucial for optimizations like global value numbering.

Modern code generators leverage LLVM IR for target-independent optimization before lowering to machine code. For example, the LLVM backend uses a target description to select instructions and allocate registers. The SSA form is maintained until register allocation, where phi nodes are eliminated via copy insertion.

Practical example: Optimizing a loop with induction variables. In SSA, the loop counter is defined by a phi node, enabling strength reduction.

``llvm ; Before optimization for.body: %i = phi i32 [ 0, %entry ], [ %next, %for.body ] %next = add i32 %i, 1 %cmp = icmp slt i32 %next, 100 br i1 %cmp, label %for.body, label %exit ``

After strength reduction, the add is replaced by an increment of a pointer, improving performance.

LLVM IR and SSA form are foundational for modern compilers, enabling aggressive optimizations while maintaining correctness.

add.llLLVM

; ModuleID = 'add.c'
define i32 @add(i32 %a, i32 %b) {
entry:
  %sum = add i32 %a, %b
  ret i32 %sum
}

🔥SSA Simplifies Analysis

📊 Production Insight

In production compilers like Clang, LLVM IR is used to perform over 100 optimization passes before code generation, significantly improving runtime performance.

🎯 Key Takeaway

LLVM IR with SSA form enables target-independent optimizations and simplifies data-flow analysis, making it a cornerstone of modern compilers.

Register Allocation: Linear Scan vs Graph Coloring

Register allocation is the process of assigning a large number of virtual registers to a limited set of physical registers. Two prominent algorithms are graph coloring and linear scan.

Graph coloring models register allocation as a graph coloring problem: each virtual register is a node, and edges represent interference (live ranges that overlap). The goal is to color the graph with k colors (physical registers) such that no adjacent nodes share the same color. If coloring fails, values are spilled to memory. This algorithm produces high-quality allocations but is computationally expensive (NP-hard in general, but heuristics like Chaitin's algorithm work well).

Linear scan, on the other hand, processes live intervals in order of their start points. It maintains a list of active intervals and assigns registers greedily. When no register is available, it spills the interval that ends furthest in the future. Linear scan is simpler and faster (O(n log n)), making it suitable for just-in-time (JIT) compilers. However, it may produce more spills than graph coloring.

Example: Consider three variables with live ranges: a (1-5), b (2-6), c (3-7). With 2 registers, graph coloring might assign a=R1, b=R2, c=spill (since c interferes with both). Linear scan would assign a=R1, b=R2, then when c starts, it sees both registers occupied and spills the one with furthest end (b), so c=R2, b=spill. Both produce one spill, but the choice differs.

Modern compilers often use a hybrid approach: linear scan for fast compilation, graph coloring for optimized code. LLVM's register allocator uses a variant of linear scan with improvements like live range splitting.

regalloc_example.pyPYTHON

# Simplified linear scan example
intervals = [('a', 1, 5), ('b', 2, 6), ('c', 3, 7)]
registers = 2
active = []
assignment = {}
for name, start, end in intervals:
    # expire old intervals
    active = [i for i in active if i[2] > start]
    if len(active) < registers:
        reg = len(active)
        assignment[name] = reg
        active.append((name, start, end))
    else:
        # spill the one with furthest end
        spill = max(active, key=lambda x: x[2])
        if spill[2] > end:
            assignment[name] = 'spill'
        else:
            assignment[spill[0]] = 'spill'
            active.remove(spill)
            assignment[name] = len(active)
            active.append((name, start, end))
print(assignment)  # {'a': 0, 'b': 1, 'c': 'spill'}

💡When to Use Each Algorithm

📊 Production Insight

LLVM's default register allocator is a linear scan variant (Greedy) that outperforms traditional linear scan by using live range splitting and spill heuristics.

🎯 Key Takeaway

Graph coloring yields better allocations but is slower; linear scan is faster but may spill more. The choice depends on compile-time vs runtime performance trade-offs.

Instruction Selection: Tree Pattern Matching

Instruction selection maps IR operations to target machine instructions. Tree pattern matching is a common technique where the IR is represented as a tree (or DAG) and matched against patterns describing target instructions. Each pattern has a cost, and the goal is to cover the tree with the lowest total cost.

For example, consider an addition of two memory loads: (a + 4) + (b + 8). In a tree representation, the root is an ADD node with two LOAD children, each having an ADD child for the address computation. A target might have an instruction addl (base, offset), reg that loads from memory and adds to a register in one step. The pattern matcher would recognize the subtree and emit that instruction.

Tools like BURG (Bottom-Up Rewrite Generator) generate matchers from grammar rules. Each rule has a cost and a template for the instruction. The matcher uses dynamic programming to find the optimal covering.

Practical example: In LLVM, the SelectionDAG phase converts LLVM IR into a DAG of SDNodes. The target's instruction selection patterns are defined in TableGen files. For instance, the x86 target has a pattern for add that matches an ADD node with two register operands and emits ADD32rr.

``tablegen // In X86InstrArithmetic.td def : Pat<(add GR32:$src1, GR32:$src2), (ADD32rr GR32:$src1, GR32:$src2)>; ``

Tree pattern matching ensures that the generated code is both correct and efficient by selecting the best instruction sequence for each IR construct.

pattern_example.tdTABLEGEN

// LLVM TableGen pattern for x86 add
def : Pat<(add GR32:$src1, GR32:$src2),
          (ADD32rr GR32:$src1, GR32:$src2)>;

// Pattern for load-add: add (load addr), reg
def : Pat<(add (load addr:$src), GR32:$dst),
          (ADD32rm GR32:$dst, addr:$src)>;

🔥Cost-Driven Selection

📊 Production Insight

LLVM's SelectionDAG uses tree pattern matching with over 10,000 patterns for x86, automatically generated from TableGen descriptions, ensuring high-quality code generation.

🎯 Key Takeaway

Tree pattern matching with costs enables optimal instruction selection by covering the IR tree with the cheapest combination of target instructions.

● Production incidentPOST-MORTEMseverity: high

The Spilled Register That Corrupted a Medical Device

Symptom

The pump's control loop would occasionally write incorrect values to actuator registers, causing erratic dosage delivery. No reproducible test case existed in the lab — the failure only appeared on hardware under load.

Assumption

The team assumed a hardware bug or cosmic bit flip. The compiler was considered trustworthy. Three weeks were spent on hardware diagnostics before anyone looked at the generated assembly.

Root cause

The compiler's register allocator spilled a live virtual register to the stack during a function call, but the callee's stack frame overlapped with the spill slot due to an incorrect frame size calculation in the code generator. The frame-pointer was omitted as part of the default optimization level, which meant the corrupt spill had no fixed reference point to detect the overlap at runtime.

Fix

Added -fno-omit-frame-pointer to the firmware build flags, which forced the code generator to use a stable frame reference and recalculate spill slot offsets correctly. Verified with -fverbose-asm that the spill slots no longer overlapped the callee frame. Recompiled the firmware; the corruption never returned. Added a CI step that compares spill slot assignments between compiler versions.

Key lesson

A spilled register can silently corrupt memory if the stack frame layout is off by a single byte — and the symptoms look exactly like a hardware fault.
Use -fverbose-asm with -g to map assembly back to source lines the moment you suspect a codegen issue. It is the fastest path from symptom to root cause.
Never trust a compiler upgrade in a safety-critical system without running differential assembly comparison on every hot function. A spill location change is a correctness change.
Run Csmith differential fuzzing before deploying a new compiler version to embedded targets — it catches miscompilations that integration tests miss because they test program behaviour, not generated code correctness.

Production debug guideSymptom → Action guide for diagnosing codegen issues in production7 entries

Symptom · 01

Program crashes only in release mode (optimized build)

→

Fix

Enable -g with optimizations to get debug symbols, then inspect the assembly with objdump -d. Look for uninitialized register reads or instructions that reference unexpected memory locations. Rebuild with -Og (optimize for debug) to isolate which optimization pass is triggering the crash — -Og enables most optimizations that affect correctness without the aggressive scheduling that hides the symptom.

Symptom · 02

Performance regression after changing a loop's compilation unit

→

Fix

Use perf record to identify hot instructions, then use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to see whether the register allocator spilled a critical loop variable. Compare -O1 vs -O2 assembly output for the function — if the spill appears only at -O2, the aggressive allocator is the culprit. Restructure the loop to reduce live variable count across iterations.

Symptom · 03

Incorrect results in floating-point calculations

→

Fix

Check whether the compiler is generating x87 instructions instead of SSE. x87 uses 80-bit internal precision, which causes unexpected rounding compared to IEEE 754 64-bit. Force SSE2 with -mfpmath=sse -msse2. Also verify that -ffast-math is not enabled accidentally — it reorders floating-point operations in ways that violate associativity and break numerically sensitive code.

Symptom · 04

Binary size explosion after adding a single function

→

Fix

Use -fverbose-asm and look for duplicated code blocks — the compiler likely inlined a large function at multiple call sites and failed to CSE the repeated sequences. Try -fno-inline on the suspect function or adjust inlining thresholds with -finline-limit. Compare .text section size with size binary before and after.

Symptom · 05

SIGILL after compiler upgrade

→

Fix

The new compiler may be generating AVX or AVX-512 instructions for a CPU that does not support them. Check with grep avx /proc/cpuinfo, then use objdump -d binary | grep -i vex to find the offending instructions. Add -mno-avx or -march=<baseline> to constrain the instruction set. This is common when upgrading compilers on build machines with newer CPUs than the deployment target.

Symptom · 06

Race condition only in optimized build

→

Fix

The instruction scheduler may have moved a load past a store to the same address. Rebuild with -fno-schedule-insns2 to disable post-register-allocation scheduling. If the race disappears, file a compiler bug — this is a correctness issue, not a tuning issue. Also run -fsanitize=thread to confirm the race independently of the scheduler fix.

Symptom · 07

Function returns incorrect value on ARM but not on x86

→

Fix

Check the calling convention difference. ARM AAPCS returns integers in r0; x86-64 System V uses rax. If you are calling across an FFI boundary without matching ABI declarations, the return value will be read from the wrong register. Verify with -mabi=aapcs on ARM and check structure layout — ARM requires natural alignment that x86 code sometimes violates silently.

★ Quick Debug: Compiler-Generated Code IssuesWhen your program behaves differently between debug and release builds, or you suspect a compiler bug, run these checks before diving into assembly.

Segfault only in optimized build−

Immediate action

Rebuild with -O0 -g. If the crash disappears, reintroduce optimizations incrementally with -Og, then -O1, then -O2 to isolate the pass.

Commands

g++ -O2 -g -S -fno-move-loop-invariants main.cpp -o main.s && cat main.s | grep -A 30 '<function_name>:'

objdump -d -S a.out | grep -A 50 '<function_name>:'

Fix now

Add __attribute__((optimize("O0"))) on the suspect function to disable optimization for that function only, confirming the crash is optimization-induced before bisecting further.

Wrong floating-point result in release build+

Stack smashing detected after function call+

Performance regression after compiler upgrade — spills in hot loop+

SIGILL when calling function compiled with AVX on older CPU+

Random memory corruption in multithreaded JIT+

Function returns wrong value after inlining+

Undefined reference to standard library symbol after compiler upgrade+

⚙ Quick Reference

10 commands from this guide

File	Command / Code	Purpose
iothecodeforgecodegenadd_example.ll	; Actual LLVM IR emitted by: clang -O1 -emit-llvm -S -o - add.c	What Is Code Generation?
iothecodeforgecodegenssa_ir_example.ll	; LLVM IR demonstrating SSA form and phi nodes	Intermediate Representation
iothecodeforgecodegenspill_demo.cpp	void high_pressure(float* __restrict__ out, const float* a, const float* b,	Register Allocation
iothecodeforgecodegeninstruction_select.c	int double_it(int x) {	Instruction Selection
iothecodeforgecodegenpeephole_demo.c	int copy_prop_example(int b) {	Peephole Optimization
dag_prune_dead_code.py	def prune_dead_nodes(dag, live_outputs):	Killing Redundant Work
register_descriptor_manager.py	class RegisterDescriptor:	Register Descriptors Are Not Optional
add.ll	; ModuleID = 'add.c'	LLVM IR and SSA Form
regalloc_example.py	intervals = [('a', 1, 5), ('b', 2, 6), ('c', 3, 7)]	Register Allocation
pattern_example.td	def : Pat<(add GR32:$src1, GR32:$src2),	Instruction Selection

Key takeaways

Code generation is the compiler stage that translates IR into target-specific machine instructions, deciding register allocation, stack layout, and instruction encoding.

A single incorrect register assignment or ABI violation in codegen can cause silent memory corruption with no compiler warning.

Debug release-build-only crashes by inspecting generated assembly and bisecting codegen passes using -fopt-info or -Rpass=.*, not your source code.

Trap instructions inserted in unreachable basic blocks can be speculatively executed by out-of-order CPUs, causing crashes at impossible source locations.

ELF relocation and PIC decisions are made during code generation; 'relocation truncated to fit' errors usually originate from codegen, not the linker.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR

What is the difference between graph coloring and linear scan register a...

Q02SENIOR

A C function compiled with -O2 crashes on ARM64 but works fine on x86-64...

Q03SENIOR

Explain the role of the 'phi' node in SSA form. Why does it exist, and w...

Q01 of 03SENIOR

What is the difference between graph coloring and linear scan register allocation? When would you choose one over the other?

ANSWER

Graph coloring builds an interference graph of all live ranges and attempts to assign physical registers with K colors using heuristics (e.g., Chaitin's algorithm). It produces better code quality but has higher compile-time overhead. Linear scan sorts live ranges by start point and greedily assigns registers in a single pass. It is faster to compile but can miss optimal allocations because it doesn't consider the full interference graph. Choose linear scan for JIT compilers where compilation latency directly affects user experience (e.g., V8's Crankshaft). Choose graph coloring for AOT compilers targeting long-running server or scientific workloads where the compile-time investment pays off in runtime performance (e.g., LLVM's greedy allocator, HotSpot's C2).

FAQ · 4 QUESTIONS

Frequently Asked Questions

What is the most common cause of silent data corruption in code generation?

How do I debug a crash that only happens in release builds?

Why does unreachable code cause crashes in modern CPUs?

What does 'relocation truncated to fit' at link time mean?

Naren Founder & Principal Engineer

20+ years shipping production systems from the metal up. Drawn from code that ran under real load.

✓ Verified

production tested

July 18, 2026

last updated

2,466

articles · all by Naren

🔥

That's Compiler Design. Mark it forged?

18 min read · try the examples if you haven't