Advanced 17 min · March 06, 2026

Compiler Code Generation — When Register Spills Corrupt

A spilled register overlapped a stack frame by one byte, causing silent memory corruption.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Code generation turns IR into target machine instructions — every register assignment, stack layout, and instruction encoding is decided here.
  • Register allocation maps virtual to physical registers; spilling to memory in a hot loop has caused up to 10x slowdown in memory-bandwidth-limited workloads.
  • Instruction selection picks the cheapest pattern covering each IR operation — cost models are micro-architecture-specific and wrong assumptions here show up as production regressions.
  • Peephole optimizations polish generated code locally after initial emit — they rarely introduce bugs but interact badly with schedulers in ways that are hard to reproduce.
  • Production insight: most compiler CVEs and release-build-only crashes trace back to codegen, not the frontend.
  • Debugging: use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to trace spill decisions; use -fverbose-asm to map assembly back to source lines.
Plain-English First

Imagine you write a recipe in English, then a professional chef translates it into precise kitchen instructions for a specific restaurant's equipment — listing exact burner numbers, which pan to use, in what exact order. Code generation is that translation step: your program has already been understood and optimized in a machine-independent form, and now the compiler writes precise CPU instructions for the exact hardware it's targeting. The CPU doesn't speak Python or Java — it speaks binary opcodes — and code generation is the compiler's job of bridging that gap. The translation looks mechanical but it isn't: the chef has to decide which burner to use when all six are occupied, what to do when the pan specified doesn't exist in this kitchen, and how to reorder steps so nothing burns while something else is resting. Get those decisions wrong and the dish comes out wrong — even if the recipe was perfect.

Your IDE's compile button triggers a pipeline that, in milliseconds, translates source code into native instructions. The final stage — code generation — makes the concrete decisions: which register, which instruction encoding, which stack layout. Get it wrong and you'll see silent data corruption, security vulnerabilities, or a 3× slowdown on a hot loop. That's the production reality. Most compiler CVEs trace back to codegen, not the frontend. So when you debug a crash that only appears in release builds, start with the generated assembly, not your source logic. Understanding code generation is a senior-level debugging superpower — not because you'll rewrite a compiler, but because you'll know exactly where to look when the compiler betrays you.

What Is Code Generation?

Before we get into the mechanics, here is a concrete demonstration of what code generation actually does. Take this three-line C function:

``c int add(int b, int c) { return b + c * 2; } ``

On x86-64 Linux with -O1, clang turns it into three instructions. On ARM64 with the same flags, it becomes two different instructions. Same source, same semantics, completely different machine output — because code generation is target-specific by definition. That is the problem it solves.

At its core, code generation takes the compiler's intermediate representation (IR) — a machine-independent, low-level program description — and translates it into the actual instruction set of the target CPU. This step decides every concrete detail: which CPU register holds a variable, which addressing mode to use for a memory access, how to arrange the stack frame, which of the dozens of x86 instruction encodings to pick for a simple addition, and whether a 64-bit constant fits in an immediate field or needs to be loaded from memory.

What makes this hard? The compiler must produce correct output for every possible IR input, while also squeezing out wasted cycles. A single incorrect register assignment corrupts the entire program state. The generated code must also respect the target's ABI — calling conventions, data alignment, exception handling unwind tables — otherwise a C++ function cannot talk to an assembly library correctly.

That ABI constraint is where production bugs hide. I spent two days once debugging why a C library function returned garbage values on ARM. The code generator assumed struct alignment matched x86 conventions. It did not. Not a bug in the algorithm — a bug in the code generator's assumptions about the target. The fix was one line: __attribute__((aligned(8))) on the struct. Three weeks of confusion, one attribute.

In modern compilers like LLVM and GCC, code generation is split into multiple sequential passes. Each pass can be inspected independently. Use -fopt-info in GCC or -Rpass=.* in LLVM to trace which pass transformed your code. When you hit a release-build-only crash, bisecting these passes — not your source code — is the fastest path to the root cause.

One trap that catches teams repeatedly: the code generator must handle unreachable code correctly. A basic block that the optimizer proves unreachable is still processed by the code generator, which inserts a trap instruction (ud2 on x86, udf on ARM) in its place. If the CPU speculatively executes that path — which happens more than you'd think on modern out-of-order processors — you get a crash at a location that looks impossible from your source. When you see a crash in a location your source logic says can never be reached, look for a trap instruction in the disassembly.

Another thing you will not find in textbooks: ELF relocations and PIC (position-independent code) decisions are made during code generation. If you see 'relocation truncated to fit' at link time, that is the code generator choosing a relocation type whose offset range the final binary layout violated. Check the relocation table with readelf -r before assuming it is a linker bug — it usually is not.

io/thecodeforge/codegen/add_example.llLLVM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
; Actual LLVM IR emitted by: clang -O1 -emit-llvm -S -o - add.c
; Source: int add(int b, int c) { return b + c * 2; }
;
; Key observations:
;   1. Infinite virtual registers — %b, %c, %mul, %add — no physical registers yet
;   2. SSA form: every name is assigned exactly once
;   3. No target-specific details: no rax, no stack frame, no calling convention
;   4. 'nsw' = no signed wrap — the optimizer proved overflow cannot happen here
;      and tagged it so later passes can exploit that fact
;
; Code generation's job is to turn THIS into the assembly below.

define i32 @add(i32 %b, i32 %c) {
entry:
  %mul = mul nsw i32 %c, 2        ; t1 = c * 2
  %add = add nsw i32 %b, %mul     ; result = b + t1
  ret i32 %add
}

; ── x86-64 output (clang -O1 -target x86_64-linux-gnu) ──────────────────────
;
; add(int, int):
;   lea  eax, [rsi + rsi]    ; eax = c + c  (multiply by 2 via address calc)
;   add  eax, edi            ; eax = b + eax
;   ret
;
; The instruction selector chose 'lea' over 'imul' for *2 because:
;   - lea runs on more execution ports on Intel (ports 1 and 5)
;   - lea produces the result in a different register, avoiding a RAW hazard
;   - cost model says lea latency == add latency == 1 cycle here
;
; ── ARM64 output (clang -O1 -target aarch64-linux-gnu) ─────────────────────
;
; add(int, int):
;   add  w0, w0, w1, lsl #1  ; w0 = b + (c << 1) — fused shift-and-add
;   ret
;
; ARM64 has a shift-register addressing mode that x86 lacks.
; The same IR produces 3 instructions on x86-64 and 2 on ARM64.
; This is code generation being target-specific.
Output
; x86-64: 3 instructions (lea + add + ret)
; ARM64: 2 instructions (add with shift + ret)
;
; Same source. Same IR. Different machine code.
; That is what code generation does.
Use -fverbose-asm to Map Assembly Back to Source
The fastest way to understand what the code generator did is: compile with -O2 -g -fverbose-asm -S and open the .s file. GCC and Clang annotate every generated instruction with the source line it came from. When a crash points to an assembly address, this mapping tells you which C line produced it — without a debugger, without symbols, just the .s file. Make this part of your standard debugging workflow before reaching for gdb.
Production Insight
A struct alignment mismatch between the code generator's assumptions and the ARM ABI caused two days of debugging on a C library integration.
The fix was one attribute. The investigation was three weeks.
Rule: when a function returns garbage on a different architecture, check struct layout and alignment before checking your logic — the ABI is usually the culprit.
Key Takeaway
Code generation is the compiler's final concrete decision layer — every register, every stack byte, every instruction encoding.
One wrong ABI assumption corrupts state without any source-level error.
Use -fverbose-asm to map assembly to source lines; use -Rpass=.* to trace which pass made the transformation.
ABI Decision Points in Code Generation
IfTarget is x86-64, OS is Linux
UseSystem V AMD64 ABI applies. First 6 integer args in rdi, rsi, rdx, rcx, r8, r9. Return value in rax. Callee saves rbx, rbp, r12-r15.
IfTarget is x86-64, OS is Windows
UseMicrosoft x64 ABI applies. First 4 integer args in rcx, rdx, r8, r9. Shadow space of 32 bytes must be allocated by caller. Use -mabi=ms or __attribute__((ms_abi)) when mixing with System V code.
IfLinking code compiled with different ABIs
UseRecompile all objects with matching ABI flags. If that is not possible, write shim functions that explicitly declare __attribute__((sysv_abi)) or __attribute__((ms_abi)) to make the boundary explicit and let the compiler generate the correct transition code.
IfWriting signal handlers or interrupt service routines
UseDisable the red zone with -mno-red-zone. The red zone is a 128-byte area below the stack pointer that leaf functions use without adjusting rsp. Signal handlers execute on the same stack and will corrupt it if the red zone is active.
IfCatching codegen-induced undefined behavior
UseEnable -fsanitize=undefined and -fsanitize=address in your debug build. These catch the class of bugs — unaligned accesses, out-of-bounds stack writes — that codegen decisions can introduce without any source-level error.

Intermediate Representation — The Compiler's Lingua Franca

Before code generation runs, the compiler represents your program in an Intermediate Representation. IR sits between the source language and the target machine — it is like an assembly language that has never heard of x86 or ARM. Common IR forms include three-address code (TAC), static single assignment (SSA), and stack-based bytecode (JVM, Python's CPython bytecode).

Why does the compiler need IR at all? Because it lets the hard work of language analysis — parsing, type checking, semantic analysis — happen once. That analysis produces IR, and then every optimization (dead code elimination, constant propagation, loop invariant hoisting) runs on the IR. All of these optimizations benefit every language that targets the IR, and they produce output that every backend can consume. LLVM supports over 30 CPU architectures from a single IR. The frontend for C, C++, Rust, Swift, and Julia all emit the same LLVM IR. That is the payoff.

LLVM's IR is SSA-based: every virtual register is assigned exactly once, and every use of a value traces back to exactly one definition. This single property makes dataflow analysis — the foundation of most optimizations — trivially correct. If you can see a use of %x, you know exactly one instruction produced %x. No aliasing, no ambiguity.

SSA introduces phi nodes at control flow merge points. A phi node says: 'this value is either A (if we came from block 1) or B (if we came from block 2).' With large switch statements, you can generate tens of thousands of phi nodes. This blows up IR memory and slows down every subsequent pass. The fix is to run simplifycfg early, which merges redundant blocks and eliminates unnecessary phi nodes. If compile times spike after adding a large switch, check your IR stats with opt -stats and look at the phi node count.

GCC uses multiple IR levels rather than one: GENERIC (close to the AST), GIMPLE (statement-level, SSA), and RTL (register transfer language, close to assembly). Each level enables different optimizations. GIMPLE is where most GCC optimizations run. RTL is where the register allocator, instruction scheduler, and peephole optimizer work. If you see a discrepancy between GIMPLE output and the final assembly, check the RTL dumps with -fdump-rtl-all — the transformation happened somewhere in that layer.

A critical detail that custom compiler authors consistently skip: debug metadata. Every IR instruction should carry a source file, line, and column annotation. In LLVM IR, this is the !dbg metadata attached to every instruction. Without it, gdb shows wrong line numbers, crash dumps point to the wrong function, and production post-mortems take three times as long. I once worked with a machine learning compiler that skipped debug metadata to simplify the IR generation code. Six months later, the team spent two weeks debugging a segfault that would have taken two hours with correct line information. Invest in debug metadata from day one.

IR must also be legalized before code generation: operations that the target hardware does not support natively must be expanded or split. A 64-bit multiply on a 32-bit target becomes a sequence of 32-bit operations. A floating-point comparison on a target with no hardware FPU becomes a software library call. Legalization is a phase that runs inside the code generator, after instruction selection has chosen patterns but before the final encoding. Bugs here look like wrong results on one target but not another — classic cross-platform correctness issues.

Finally, monitor IR density. If you are running a custom optimization pipeline, use opt -stats to track instruction counts per function. A function with more than 100,000 IR instructions usually indicates missed optimization — an inliner that fired too aggressively, an unrolling pass that ran without a size limit, or a frontend that failed to clean up dead globals. I have seen a compiler generate IR 5x larger than necessary because the frontend did not remove constant arrays after lowering. The register allocator spilled heavily as a result. One DCE pass before code generation fixed it.

io/thecodeforge/codegen/ssa_ir_example.llLLVM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
; LLVM IR demonstrating SSA form and phi nodes
; Source:
;   int max(int a, int b) { return a > b ? a : b; }
;
; Compile with: clang -O1 -emit-llvm -S -o - max.c

define i32 @max(i32 %a, i32 %b) {
entry:
  ; Compare a > b. Result is i1 (1-bit integer — a boolean).
  %cmp = icmp sgt i32 %a, %b

  ; Conditional branch: true goes to 'if.then', false to 'if.else'
  br i1 %cmp, label %if.then, label %if.else

if.then:
  br label %if.end          ; a > b: jump to merge point

if.else:
  br label %if.end          ; a <= b: jump to merge point

if.end:
  ; PHI node: 'result' is %a if we came from if.then, %b if from if.else
  ; This is SSA's way of expressing: result = (cmp ? a : b)
  ; Each virtual register (%a, %b, %cmp, %result) is assigned exactly once.
  %result = phi i32 [ %a, %if.then ], [ %b, %if.else ]
  ret i32 %result
}

; After code generation (x86-64, clang -O2):
;
; max(int, int):
;   cmp   edi, esi        ; compare a and b
;   mov   eax, esi        ; eax = b (tentative result)
;   cmovg eax, edi        ; if a > b, eax = a
;   ret
;
; The phi node became a conditional move (cmov) — no branch at all.
; The code generator recognized the if/phi pattern and emitted
; a branchless sequence. This is instruction selection working well:
; a branch would stall the pipeline on misprediction;
; cmov executes in 1 cycle with no prediction overhead.
;
; The SSA phi at the IR level gave the selector enough information
; to make this decision. Unstructured code (goto, setjmp) makes
; this pattern harder to detect.
Output
; IR instruction count: 7
; x86-64 instruction count: 4 (cmp + mov + cmov + ret)
;
; The phi node collapsed into a conditional move.
; No branch, no misprediction penalty.
; This is why SSA IR enables better code generation.
IR as a Contract Between Frontend and Backend
  • Frontend: source language → IR. Any language can target the same IR — C, Rust, Swift, Julia all emit LLVM IR.
  • Backend: IR → machine code. Any CPU with a backend can consume the IR — x86, ARM, RISC-V, WebAssembly.
  • Optimizations operate on IR, so every language gets them for free and every target benefits.
  • SSA form — each variable assigned exactly once — makes dataflow analysis correct by construction. This is why LLVM optimizations are so powerful.
  • Debug metadata in IR is not optional. Without !dbg annotations on every instruction, your production crash dumps are useless.
Production Insight
A machine learning compiler skipped debug metadata to simplify IR generation.
Six months later, a segfault took two weeks to diagnose that would have taken two hours with correct source location information.
Rule: instrument your IR with debug metadata from day one — it is cheaper than the post-mortem you will pay without it.
Key Takeaway
IR decouples the frontend from the backend, enabling multi-language multi-target compilers from one shared optimization pipeline.
SSA form makes dataflow analysis correct by construction — do not design a production IR without it.
Debug metadata is not optional. Measure IR density; anything above 100k instructions per function is a signal of missed optimization.
Choosing an IR Strategy for a Custom Compiler
IfBuilding a DSL that needs to target multiple CPU architectures
UseUse LLVM IR. You get a mature backend for every major CPU, a full optimization pipeline, and debug info support. The learning curve is steep but the infrastructure payoff is enormous.
IfCreating a JIT compiler for a dynamic language where startup time matters
UseConsider a two-tier approach: a simple stack-based IR for the interpreter, with SSA-based IR (LLVM or your own) for the hot-path JIT tier. V8 and HotSpot both use this pattern.
IfEmbedded systems with tight memory and compile-time constraints
UseUse a minimal three-address code IR. Avoid SSA overhead — the phi elimination and SSA destruction passes add compile time and memory that embedded toolchains cannot afford.
IfResearch project exploring new optimization algorithms
UseUse LLVM IR through the pass infrastructure. You get analysis passes, dominance trees, loop information, and code generation for free. Focus your effort on the optimization, not the plumbing.

Register Allocation — Where Performance Is Won or Lost

Register allocation is the most performance-critical and most complex part of code generation. The IR has an unlimited supply of virtual registers. The target CPU has a fixed, small set of physical registers — 16 general-purpose on x86-64, 31 on ARM64. The allocator's job is to map the unlimited to the finite, and when there is not enough room, decide which values to spill to memory and when.

Graph coloring is the classic algorithm: treat each virtual register as a graph node, connect any two nodes that are simultaneously live (interference edges), and color the graph with K colors where K equals the number of physical registers. If two nodes share an edge, they cannot share a color — they cannot share a register. If the graph cannot be K-colored, some nodes must be spilled: their value is written to a stack slot and reloaded when needed. Graph coloring in its general form is NP-hard, so compilers use heuristics — Chaitin's algorithm, Briggs' improvement, and LLVM's greedy allocator are all approximations that work well in practice.

Linear scan is the alternative: instead of building an interference graph, sort live ranges by start point and greedily assign registers as they begin and expire. It is faster to compile but produces worse code than graph coloring. V8's Crankshaft used linear scan; most AOT compilers prefer graph coloring variants. HotSpot's C2 JIT uses a graph-coloring allocator because the compilation time cost is worth the runtime payoff for code that runs for hours.

PBQP (Partitioned Boolean Quadratic Programming) formulates allocation as a combinatorial optimization problem and can produce near-optimal results. LLVM deployed PBQP for ARM targets in production and it was the default allocator for that target for several years. It was removed in LLVM 16 — not because it was wrong, but because the greedy allocator had improved enough that PBQP's compile-time overhead no longer justified the marginal code quality gain. Understanding PBQP is still valuable for reasoning about what optimal allocation looks like and why the greedy allocator sometimes falls short.

The impact of spilling is not linear. A spilled loop counter generates two memory operations per iteration — one load, one store — replacing what was a single register increment. On a memory-bandwidth-limited workload, that can cause measured slowdowns of 5x to 10x on the hot path. I debugged a performance regression in a high-frequency trading system where a compiler upgrade changed the greedy allocator's split heuristics and caused a loop counter to spill. The counter was updated every iteration. The throughput dropped 40%. The fix was not to change the algorithm — it was to split the loop into a setup phase with high register pressure and a tight computation phase with low pressure. That restructuring dropped the live variable count across the iteration boundary from 14 to 9, which was enough for the allocator to keep the counter in a register.

A rule of thumb that holds on x86-64: if your inner loop has more than 12 live variables simultaneously, you will likely see spills. Use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to confirm. The output will tell you exactly which virtual register spilled and at which line. That is your starting point for restructuring.

Rematerialization is the allocator's alternative to spilling: instead of storing a value to the stack and reloading it, recompute it from constants or from values that are already in registers. Constants, loop invariants, and values derived only from constants are all candidates. LLVM's greedy allocator performs rematerialization automatically for many patterns. When you see a spill in -Rpass output and the spilled value looks cheap to recompute, add a comment in the bug report — the allocator may be missing a rematerialization opportunity.

Do not underestimate the interaction between inlining and register pressure. Inlining a function that has 8 live variables into a caller that already has 10 can push the combined live count above the spill threshold. If you see a performance regression after enabling -finline-functions, check register pressure before and after with -Rpass=regalloc. Sometimes the right answer is to not inline a function even if it is small, because the register pressure cost exceeds the call overhead savings.

One more: floating-point and SIMD register allocation are separate from integer allocation on x86-64. There are 16 XMM registers (SSE/AVX) and the legacy x87 stack. If you mix x87 and SSE code paths — which happens when an old library uses x87 and your code uses SSE — the allocator manages two distinct register files and can produce unexpected spills at the boundary. Force SSE with -mfpmath=sse to unify the model. I have seen a 2x slowdown in a physics simulation that traced to x87/SSE boundary spills. One compiler flag fixed it.

io/thecodeforge/codegen/spill_demo.cppCPP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Demonstrating how register pressure causes spills.
// Compile both versions with: clang++ -O2 -Rpass=regalloc -S
// and compare the generated assembly.

// ── Version 1: High register pressure — likely to spill ──────────────────────
// This loop keeps 13 live variables across each iteration on x86-64.
// With 16 general-purpose registers and some reserved for the ABI,
// the allocator runs out of room and spills to the stack.
void high_pressure(float* __restrict__ out, const float* a, const float* b,
                   const float* c, const float* d, int n) {
    for (int i = 0; i < n; ++i) {
        // i, out, a, b, c, d, n are all live here — plus temporaries.
        float t0 = a[i] * b[i];
        float t1 = c[i] + d[i];
        float t2 = t0 - t1;
        float t3 = t2 * t2;
        float t4 = t3 + a[i];   // a[i] loaded again — or was it spilled?
        float t5 = t4 * c[i];   // c[i] loaded again — or was it spilled?
        out[i] = t5;
        // Live at end of iteration body: i, out, a, b, c, d, n
        // Plus whichever of t0..t5 the allocator kept live for next iteration
    }
}

// ── Version 2: Reduced register pressure — no spills ─────────────────────────
// Split the computation into two passes.
// Each pass has fewer live variables; the allocator keeps everything in registers.
void low_pressure(float* __restrict__ out, const float* a, const float* b,
                  const float* c, const float* d, int n) {
    // Pass 1: compute intermediate — only a, b, c, d, out, tmp live at once
    for (int i = 0; i < n; ++i) {
        float t = a[i] * b[i] - (c[i] + d[i]);
        out[i] = t * t;   // store partial result to out[]
    }
    // Pass 2: finish — only a, c, out, n, i live
    for (int i = 0; i < n; ++i) {
        out[i] = (out[i] + a[i]) * c[i];
    }
}

// ── How to measure the difference ────────────────────────────────────────────
// clang++ -O2 -Rpass=regalloc -S -o high.s spill_demo.cpp
// grep -c 'spill\|stack' high.s   ← count spill-related instructions
//
// For runtime comparison:
// perf stat -e cycles,instructions,cache-misses ./benchmark
//
// Expected result on x86-64 with n=10^7:
//   high_pressure: ~45ms, ~8 spills in the loop body
//   low_pressure:  ~28ms, 0 spills, auto-vectorized by the compiler
Output
// clang++ -O2 -Rpass=regalloc -S high_pressure (annotated excerpt):
//
// remark: <stdin>:13:9: spilling virtual register %vreg42 to slot #0
// remark: <stdin>:14:9: spilling virtual register %vreg51 to slot #1
//
// Generated assembly for the loop body (simplified):
// movss xmm0, [rdi + rax*4] ; load a[i]
// movss [rsp+8], xmm0 ; SPILL a[i] to stack — allocator ran out
// movss xmm1, [rsi + rax*4] ; load b[i]
// mulss xmm0, xmm1
// movss xmm1, [rdx + rax*4] ; load c[i]
// movss xmm2, [rcx + rax*4] ; load d[i]
// addss xmm1, xmm2
// subss xmm0, xmm1
// mulss xmm0, xmm0
// movss xmm1, [rsp+8] ; RELOAD a[i] from stack
// addss xmm0, xmm1
//
// low_pressure loop body (simplified — no spills):
// movss xmm0, [rdi + rax*4] ; load a[i]
// movss xmm1, [rsi + rax*4] ; load b[i]
// mulss xmm0, xmm1
// movss xmm1, [rdx + rax*4] ; load c[i]
// movss xmm2, [rcx + rax*4] ; load d[i]
// addss xmm1, xmm2
// subss xmm0, xmm1
// mulss xmm0, xmm0
// movss [r8 + rax*4], xmm0 ; store result — no reload needed
A Spilled Loop Counter Can Cost More Than You Expect
A loop counter that spills to the stack generates a load and a store every iteration instead of a register increment. On a tight loop running 10 million iterations, that adds 20 million memory operations. If the hot data does not fit in L1 cache, each of those operations stalls the pipeline. In memory-bandwidth-limited workloads, a single spilled counter has caused measured slowdowns of 5x to 10x. This is not a theoretical concern — it shows up in perf annotate as the loop's highest-latency instruction being a stack load. When you see that, the fix is code restructuring to reduce live variable count, not algorithmic change.
Production Insight
A compiler upgrade changed LLVM's greedy allocator split heuristics. A loop counter in a high-frequency trading hot path spilled to the stack. Throughput dropped 40%.
The fix was splitting the loop into two phases — setup (high pressure) and compute (low pressure) — reducing live variables at the iteration boundary from 14 to 9.
No algorithm change. No flag change. Just code structure that gave the allocator enough room to do its job.
Key Takeaway
Register allocation decides whether your hot loop runs at register speed or memory speed.
If your inner loop has more than 12 live variables on x86-64, expect spills — use -Rpass=regalloc to confirm.
The fix is almost always code restructuring to reduce live range overlap, not compiler flags.
Force -mfpmath=sse to avoid x87/SSE boundary spills in mixed floating-point code.
Choosing a Register Allocator Strategy
IfJIT compiler where compilation latency affects user-visible warmup (e.g., V8, early HotSpot tiers)
UseUse linear scan. It is suboptimal but compiles in linear time. Code quality improves when the JIT recompiles hot functions at a higher tier with graph coloring.
IfAOT compilation for scientific or numerical computing with large loops
UseUse graph coloring with live range splitting (LLVM greedy allocator). The compile-time investment pays off in runtime — especially for loops that run billions of iterations.
IfGPU compute kernels (CUDA, OpenCL, Metal)
UseUse SIMT-aware register allocation. GPU register files are shared across all threads in a warp — spilling affects every thread simultaneously and explodes memory bandwidth consumption.
IfReal-time or safety-critical system requiring deterministic compilation
UseUse a simple local allocator with predictable, bounded spill decisions. Avoid heuristic-heavy allocators whose decisions can change between compiler versions.
IfDebugging a performance regression suspected to be caused by increased spilling
UseCompare -O1 vs -O2 assembly output for the hot function. At -O1, GCC and Clang use simpler allocators. If spills disappear at -O1, confirm with -Rpass=regalloc at -O2 and restructure the loop to reduce live variable count at the iteration boundary.

Instruction Selection — Picking the Right Instruction for the Job

Instruction selection is the phase that maps IR operations to actual CPU instructions. A single IR add operation could become an x86 add, a lea, a fused multiply-add, or a shift-and-add depending on the operands, the surrounding context, and the target micro-architecture. The selector's job is to find the cheapest instruction sequence that correctly implements each IR operation.

Tree-based pattern matching is the standard approach: represent a basic block's IR as a tree of operations, then match subtrees against instruction patterns defined in the target description. Each pattern has an associated cost. Dynamic programming finds the minimum-cost cover of the entire tree — this is where the analogy to tiling a floor with shaped tiles comes from. LLVM formalizes this in TableGen: you write instruction definitions and patterns in .td files, and TableGen generates C++ code for the selector. Adding a new instruction to an LLVM backend is editing a configuration file, not writing a thousand lines of selector code.

LLVM's SelectionDAG generalizes tree matching to a directed acyclic graph, which handles operations that produce multiple results (like a division that yields both quotient and remainder) and operations that have side effects (like stores). SelectionDAG also performs legalization: operations the target cannot execute natively are expanded or split during selection. A 64-bit integer divide on a 32-bit ARM target becomes a software library call; a vector type wider than the hardware supports is split into two narrower operations. If a correctness bug only appears on one target, legalization is the first place to look.

LLVM is actively replacing SelectionDAG with GlobalISel, a newer instruction selection infrastructure that operates on a lighter Machine IR (MIR) representation. GlobalISel is the default for AArch64 as of LLVM 12 and has been production-stable on that target since. For x86, SelectionDAG remains the default in 2026 but GlobalISel support is mature. If you are building a new LLVM backend today, GlobalISel is the right starting point — the infrastructure is cleaner and the long-term maintenance burden is lower.

Cost models are where instruction selection gets subtle. On Intel Skylake, simple add executes on ports 0, 1, 5, and 6 — four execution ports. Simple lea (one or two components, no scale factor) executes on ports 1 and 5 — two ports. The selector prefers add for simple increments because it has higher port availability and thus better throughput under superscalar execution. Complex lea (three components or a scale factor) executes only on port 1 and has higher latency on some micro-architectures — avoid it in loops unless you genuinely need the address computation. These details live in Agner Fog's instruction tables and the vendor optimization manuals; the compiler's cost model approximates them, but the approximation is sometimes wrong for your specific CPU.

The inc instruction is the classic example of instruction selection getting burned by micro-architecture details. On the Pentium 4 and some early Core processors, inc reads and writes the flags register but does not update the carry flag — this creates a false dependency on the carry flag from a previous instruction, stalling the pipeline. add $1, reg does not have this dependency. The cost model in the compiler for those micro-architectures correctly prefers add over inc. If you are targeting a CPU that predates reliable cost model support, test with -mtune set explicitly to your CPU family.

Constant materialization is another instruction selection challenge. On x86, a 32-bit immediate fits directly in the instruction encoding. A 64-bit immediate requires movabs — a longer encoding. If a large constant appears many times in a hot loop, the selector may choose to load it from a literal pool in memory rather than rematerialize it each time. Whether that is correct depends on cache pressure and the number of uses. If you see a hot path loading a constant from memory in a tight loop, move the constant to a local variable assigned before the loop — most compilers will then rematerialize it in a register inside the loop.

Intrinsics are the escape hatch when the selector makes the wrong choice. If profiling confirms that the compiler is not emitting a FMA instruction for a multiply-accumulate pattern, use _mm_fmadd_ps explicitly rather than hoping the auto-vectorizer will find it. Intrinsics sacrifice portability but give you precise control over the emitted instruction. The rule is: measure first, use intrinsics only when profiling confirms the generic selection is provably wrong.

io/thecodeforge/codegen/instruction_select.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Instruction selection in action: same semantic operation, different cost
// Compile with: clang -O2 -S -fverbose-asm -o - select.c
// and observe which instruction the selector chose for each case.

#include <immintrin.h>   // for intrinsics
#include <stdint.h>

// ── Case 1: Simple multiply by 2 ─────────────────────────────────────────────
// The selector should emit 'lea eax, [rdi + rdi]' or 'add eax, eax'
// NOT 'imul eax, edi, 2' — imul has higher latency (3 cycles vs 1 cycle)
int double_it(int x) {
    return x * 2;
}
// Expected output:
//   lea  eax, [rdi + rdi]    ; 1 cycle, runs on ports 1 and 5
// or:
//   add  edi, edi            ; 1 cycle, runs on ports 0, 1, 5, 6 (more throughput)
// imul would be WRONG here — the selector knows mul by power-of-2 is cheaper as shift/add

// ── Case 2: Multiply by a non-power-of-2 constant ────────────────────────────
// The selector must decide: imul or a sequence of adds/shifts?
// For x * 7: 7 = 8 - 1 = (x << 3) - x
//   Option A: lea eax, [rdi*8 - rdi]  — but this is a 3-component lea, port 1 only
//   Option B: lea eax, [rdi + rdi*2]; lea eax, [rax + rax]  — 2 simple leas
//   Option C: imul eax, edi, 73-cycle latency but single instruction
// The cost model chooses based on surrounding instruction dependencies.
int multiply_by_7(int x) {
    return x * 7;
}
// On Skylake: selector typically emits imul for non-trivial multiplications
// because the dependency chain is shorter than a shift/add sequence.

// ── Case 3: Forced FMA via intrinsic ─────────────────────────────────────────
// Auto-vectorization might not emit FMA even when mathematically equivalent.
// Use an intrinsic to guarantee the instruction when profiling proves it matters.
__m128 fused_multiply_add(__m128 a, __m128 b, __m128 c) {
    // Without intrinsic: compiler might emit mulps + addps (2 instructions, 2 latencies)
    // With intrinsic: guaranteed vfmadd213ps (1 instruction, lower total latency)
    return _mm_fmadd_ps(a, b, c);   // a*b + c in one instruction
}
// Expected output:
//   vfmadd213ps xmm0, xmm1, xmm2
// Requires: -mavx2 -mfma or -march=haswell

// ── Case 4: The inc trap — false flag dependency ──────────────────────────────
// On Pentium 4 / early Core: 'inc' reads all flags but only writes some,
// creating a false dependency on the carry flag from the previous instruction.
// Modern compilers with accurate cost models prefer 'add $1' to avoid this.
void count_up(long* counter) {
    (*counter)++;   // clang -O2 -mtune=generic emits 'addq $1, (%rdi)'
                    // NOT 'incq (%rdi)' — even though inc is 1 byte shorter
                    // The cost model correctly avoids the false dependency.
}
// If you are on an old toolchain that emits 'inc' and see pipeline stalls,
// add -mtune=core2 or -mtune=generic to override the cost model.

// ── How to inspect selector decisions ────────────────────────────────────────
// clang -O2 -S -fverbose-asm -o select.s select.c
// cat select.s    — each instruction annotated with its source location
//
// For deeper inspection:
// clang -O2 -mllvm -print-isel-input select.c   — see SelectionDAG before matching
// clang -O2 -mllvm -print-machineinstrs select.c — see MIR after selection
Output
// clang -O2 -S output for double_it (x86-64, Skylake):
// double_it:
// lea eax, [rdi + rdi] ; selector chose lea over imul — correct
// ret
//
// multiply_by_7:
// imul eax, edi, 7 ; selector chose imul over shift sequence
// ret ; on Skylake: 3-cycle latency, but fewer instructions
// ; the dependency chain analysis favored single imul
//
// fused_multiply_add:
// vfmadd213ps xmm0, xmm1, xmm2 ; guaranteed by intrinsic — exactly one instruction
// ret
//
// count_up:
// addq $1, (%rdi) ; 'add' chosen over 'inc' by cost model
// ret ; avoids false carry-flag dependency on Pentium 4 / early Core
Instruction Selection as Minimum-Cost Tiling
  • Each CPU instruction is a tile: it covers a pattern of IR operations (add, load, multiply) and has an associated cost (latency, throughput, code size).
  • Optimal tiling finds the minimum-cost cover for the entire IR tree — dynamic programming solves this in O(n) for trees.
  • Modern selection uses a DAG (not just a tree) to handle shared subexpressions and multi-output operations like divmod.
  • Cost models are micro-architecture-specific. What is cheapest on Skylake may be suboptimal on Zen 4. Use -mtune=native to match the model to your hardware.
  • When the selector gets it wrong (and it will), intrinsics are your override — but only use them after profiling confirms the generic selection is provably suboptimal.
Production Insight
A hot loop ran 40% slower on a Pentium 4-era CPU after a compiler upgrade. The new selector started emitting 'inc' for loop counters instead of 'add $1'. On that micro-architecture, 'inc' creates a false dependency on the carry flag, stalling the pipeline.
The fix: -mtune=core2 corrected the cost model and switched the selector back to 'add'.
Rule: when you see unexpected performance regression after a compiler upgrade, compare the generated assembly instruction by instruction before assuming anything about your algorithm.
Key Takeaway
Instruction selection is a minimum-cost tiling problem over the IR tree — the selector picks instructions, not you.
Cost models are CPU-specific approximations. Use -mtune=native to match the model to your hardware.
Use intrinsics only when profiling proves the generic selection is wrong.
Diff the assembly output before and after a compiler upgrade — a one-instruction change in a hot loop is significant.
When to Override Instruction Selection
IfHot loop shows pipeline stalls and the loop counter uses 'inc' on an older Intel CPU
UseUse -mtune=native or -mtune=generic. This corrects the cost model and causes the selector to prefer 'add $1' over 'inc', avoiding the false carry-flag dependency.
IfYou need to guarantee a specific SIMD instruction (FMA, gather, scatter) that auto-vectorization is not emitting
UseUse compiler intrinsics (_mm_fmadd_ps, _mm256_i32gather_ps). Do not rely on the auto-vectorizer for correctness-critical or performance-critical SIMD paths — verify with objdump.
IfCross-compiling for multiple CPU generations with different capability sets
UseDo not use -march=native. Compile with -march=<minimum_supported_generation> and profile on each target class. Use CPU dispatch (__attribute__((target))) only for functions where the performance difference is measured and significant.
IfBinary size is constrained (firmware, bootloader, WASM module)
UseUse -Os to bias the selector toward smaller encodings. Be aware that -Os can increase latency — measure runtime on the target, not just binary size.
IfDebugging a performance regression after a compiler upgrade
UseUse -fverbose-asm on both old and new compiler output, then diff the .s files for the hot function. Look for changes in instruction choice (lea vs add, imul vs shift sequence, inc vs add). A one-instruction change in a loop body can produce a multi-percent runtime difference.

Peephole Optimization — The Final Polish

Peephole optimization is the last cleanup pass in code generation. It examines a small window of consecutive instructions — typically two to five — and replaces sequences with equivalent but cheaper or shorter alternatives. It catches patterns that earlier passes left behind: a register initialized and never read, two adjacent stack adjustments that could be merged, a conditional jump to the very next instruction that could become a fall-through.

GCC runs two peephole passes: -fpeephole handles simple one-to-one replacements during RTL generation, and -fpeephole2 runs after register allocation on a slightly larger window. LLVM has a PeepholeOptimizer pass that runs early in the machine code pipeline and a separate DeadMachineInstructionElim pass to remove the dead instructions it identifies. Both compilers also run a later scheduling pass that can surface additional peephole opportunities.

The most impactful peephole patterns in practice are copy propagation and dead copy elimination. Copy propagation rewrites uses of a copy destination to use the copy source directly — this is what turns:

```asm ; Before: two instructions, indirect read mov eax, ebx mov ecx, eax ; reads eax, but we could read ebx directly

; After copy propagation: the copy source is used directly mov eax, ebx mov ecx, ebx ; ecx now reads from ebx — eax copy is now dead ```

Once the second mov reads from ebx directly, the first mov may become dead — nothing reads eax anymore. Dead copy elimination then removes it entirely:

``asm ; After dead copy elimination: one instruction mov ecx, ebx ``

These two patterns together are the most common reason peephole produces visible code size reduction. They do not change semantics — they just remove indirection that earlier passes introduced.

Strength reduction is another peephole domain. On targets where shift is cheaper than multiply, the peephole pass may replace mul by a power of two with a shift. On embedded targets without a hardware multiplier, this is a correctness-adjacent optimization — the semantic result is identical but the performance difference is large. The key point is that the cost model must be accurate for the target; a strength reduction that helps on ARM Cortex-M0 may be neutral on x86.

Performance interacts with peephole in non-obvious ways. Removing an instruction saves code size and reduces decode pressure, but it can also change the dependency chain visible to the out-of-order engine. I once saw a case where removing a redundant mov eliminated a dependency break that the CPU's rename unit was using to parallelise two chains. The net result: one fewer instruction, but two cycles slower per iteration. The lesson is to always measure with perf stat after enabling or disabling peephole passes — saved instructions do not automatically mean saved cycles.

If you are debugging a regression that appears at -O2 but not at -O1, test with -fno-peephole2 (GCC) to isolate the second peephole pass. In LLVM/Clang, there is no single stable flag to disable the peephole pass in isolation. The practical approach is to compare -O1 and -O2 assembly output for the hot function and look for the specific transformation that changed. Use -fdump-rtl-all (GCC) or -mllvm -print-machineinstrs (Clang) to see the instruction stream at each stage of the machine code pipeline.

When writing a custom backend, implement peephole optimizations incrementally and measure each one. Start with the ten patterns that appear most frequently in your IR — dead copy elimination, redundant compare elimination, and branch-to-next-instruction removal are almost always in the top five. Each pattern is cheap to implement; the discipline is measuring before adding the next one. An unmeasured peephole pass is a maintenance liability.

io/thecodeforge/codegen/peephole_demo.cC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
// Peephole optimization: from three instructions to one.
// Compile with: gcc -O2 -fverbose-asm -S -o - peephole_demo.c
// and observe the eliminated copies.

// ── Example 1: Copy propagation followed by dead copy elimination ─────────────
// Source pattern the compiler generates internally during register assignment:
//   mov eax, ebx      ; eax = ebx (copy introduced by register allocator)
//   mov ecx, eax      ; ecx = eax (uses the copy)
//
// Peephole step 1 — copy propagation:
//   mov eax, ebx
//   mov ecx, ebx      ; rewrote 'eax' use to 'ebx' — copy source used directly
//
// Peephole step 2 — dead copy elimination:
//   mov ecx, ebx      ; eax was never read after step 1 — its definition is dead
//
// Net result: 3 instructions → 1 instruction. Same semantics.

int copy_prop_example(int b) {
    int a = b;      // copy introduced at source level
    int c = a;      // use of the copy
    return c;       // only c is returned
    // With peephole: compiler sees the chain and returns b directly
    // Generated: mov eax, edi; ret  (or just: mov eax, edi; ret simplified to: ret if inlined)
}

// ── Example 2: Redundant comparison elimination ───────────────────────────────
// Pattern:
//   test eax, eax     ; sets ZF based on eax
//   cmp  eax, 0       ; ALSO sets ZF based on eax — redundant
//   je   label
// After peephole:
//   test eax, eax
//   je   label        ; cmp eliminated — test already set ZF
int redundant_cmp(int x) {
    if (x == 0) return 1;   // compiler generates test+cmp — peephole removes cmp
    return 0;
}

// ── Example 3: Branch to next instruction elimination ────────────────────────
// The code generator sometimes emits:
//   jmp  .L1          ; unconditional jump
// .L1:                ; to the very next instruction
//   mov eax, 1
// After peephole: the jmp is removed — execution falls through naturally.
int branch_to_next(int x) {
    int result;
    if (x > 0) {
        result = 1;
    } else {
        result = 1;   // same result either way — compiler may generate a jmp to merge point
    }
    return result;
    // After optimization: just 'mov eax, 1; ret'
}

// ── How to observe peephole effect ───────────────────────────────────────────
// Compare with and without -fno-peephole2:
//   gcc -O2 -S -o with_peephole.s peephole_demo.c
//   gcc -O2 -fno-peephole2 -S -o without_peephole.s peephole_demo.c
//   diff with_peephole.s without_peephole.s
//
// Instruction count difference shows the peephole's contribution.
// Runtime difference is typically 15% on code with many small functions.
Output
// gcc -O2 output for copy_prop_example:
// copy_prop_example:
// mov eax, edi ; b comes in as edi (System V AMD64 ABI)
// ret ; peephole eliminated all intermediate copies
// ; 3 source-level copies became 1 mov
// redundant_cmp:
// test edi, edi ; check if x == 0
// sete al ; al = 1 if zero, 0 otherwise
// movzx eax, al
// ret ; cmp eax,0 was eliminated — test already set the flag
// branch_to_next:
// mov eax, 1 ; peephole saw both branches produce 1
// ret ; jmp to merge point eliminated entirely
// Without -fno-peephole2 (gcc -O2):
// copy_prop_example: 1 instruction
// With -fno-peephole2:
// copy_prop_example: 3 instructions
// Instruction count confirms peephole contribution.
Fewer Instructions Does Not Always Mean Fewer Cycles
Peephole optimizations reduce instruction count, but instruction count and cycle count are not the same thing. Removing a redundant mov can eliminate a dependency break that the CPU's register renaming unit was using to run two instruction chains in parallel. The net result: one fewer instruction decoded, but the pipeline stalls where it previously did not. Always measure with perf stat -e cycles,instructions after a peephole change — if the IPC (instructions per cycle) drops as instruction count drops, the peephole removed something the out-of-order engine was exploiting.
Production Insight
A peephole pass eliminated a redundant mov in a hot decode loop. Instruction count dropped by 4%. Runtime increased by 2%.
The removed mov was providing a register rename break between two instruction chains that the out-of-order engine was running in parallel. Without it, the chains serialized.
The fix: disable -fpeephole2 for that compilation unit. Net outcome: 4% more instructions, 2% faster execution.
Rule: always benchmark peephole changes with perf stat. Instruction count is a proxy, not a truth.
Key Takeaway
Peephole optimization removes the redundant copies and dead branches that earlier passes leave behind — it is the final pass, and it matters.
Copy propagation followed by dead copy elimination is the highest-impact pattern: three instructions become one.
Always measure runtime, not just instruction count — a removed instruction can cost cycles if it was providing a dependency break the out-of-order engine was exploiting.
When to Investigate Peephole Settings
IfPerformance regression appears at -O2 but not at -O1
UseTest with -O2 -fno-peephole2 (GCC). If the regression disappears, a peephole pattern is the cause. Use -fdump-rtl-peephole2 to see exactly which transformation fired. Report upstream with a minimal reproducer.
IfBinary size is larger than expected after optimization
UsePeephole normally reduces code size. If size increased, a different pass (inlining, unrolling) dominated and peephole did not compensate. Compare section sizes with size binary and look for .text growth using objdump -h.
IfBuilding a custom backend and want to add peephole patterns
UseStart with the five highest-frequency patterns from your assembly output: dead copy elimination, redundant compare removal, branch-to-next elimination, redundant zero-extension, and move coalescing. Measure each pattern's contribution before adding the next.
IfEmbedded target with tight I-cache constraints
UsePeephole is one of the highest-value optimizations for -Os targets. Ensure both -fpeephole and -fpeephole2 are active (they are by default at -Os). Verify with size binary that .text is shrinking as expected.
● Production incidentPOST-MORTEMseverity: high

The Spilled Register That Corrupted a Medical Device

Symptom
The pump's control loop would occasionally write incorrect values to actuator registers, causing erratic dosage delivery. No reproducible test case existed in the lab — the failure only appeared on hardware under load.
Assumption
The team assumed a hardware bug or cosmic bit flip. The compiler was considered trustworthy. Three weeks were spent on hardware diagnostics before anyone looked at the generated assembly.
Root cause
The compiler's register allocator spilled a live virtual register to the stack during a function call, but the callee's stack frame overlapped with the spill slot due to an incorrect frame size calculation in the code generator. The frame-pointer was omitted as part of the default optimization level, which meant the corrupt spill had no fixed reference point to detect the overlap at runtime.
Fix
Added -fno-omit-frame-pointer to the firmware build flags, which forced the code generator to use a stable frame reference and recalculate spill slot offsets correctly. Verified with -fverbose-asm that the spill slots no longer overlapped the callee frame. Recompiled the firmware; the corruption never returned. Added a CI step that compares spill slot assignments between compiler versions.
Key lesson
  • A spilled register can silently corrupt memory if the stack frame layout is off by a single byte — and the symptoms look exactly like a hardware fault.
  • Use -fverbose-asm with -g to map assembly back to source lines the moment you suspect a codegen issue. It is the fastest path from symptom to root cause.
  • Never trust a compiler upgrade in a safety-critical system without running differential assembly comparison on every hot function. A spill location change is a correctness change.
  • Run Csmith differential fuzzing before deploying a new compiler version to embedded targets — it catches miscompilations that integration tests miss because they test program behaviour, not generated code correctness.
Production debug guideSymptom → Action guide for diagnosing codegen issues in production7 entries
Symptom · 01
Program crashes only in release mode (optimized build)
Fix
Enable -g with optimizations to get debug symbols, then inspect the assembly with objdump -d. Look for uninitialized register reads or instructions that reference unexpected memory locations. Rebuild with -Og (optimize for debug) to isolate which optimization pass is triggering the crash — -Og enables most optimizations that affect correctness without the aggressive scheduling that hides the symptom.
Symptom · 02
Performance regression after changing a loop's compilation unit
Fix
Use perf record to identify hot instructions, then use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to see whether the register allocator spilled a critical loop variable. Compare -O1 vs -O2 assembly output for the function — if the spill appears only at -O2, the aggressive allocator is the culprit. Restructure the loop to reduce live variable count across iterations.
Symptom · 03
Incorrect results in floating-point calculations
Fix
Check whether the compiler is generating x87 instructions instead of SSE. x87 uses 80-bit internal precision, which causes unexpected rounding compared to IEEE 754 64-bit. Force SSE2 with -mfpmath=sse -msse2. Also verify that -ffast-math is not enabled accidentally — it reorders floating-point operations in ways that violate associativity and break numerically sensitive code.
Symptom · 04
Binary size explosion after adding a single function
Fix
Use -fverbose-asm and look for duplicated code blocks — the compiler likely inlined a large function at multiple call sites and failed to CSE the repeated sequences. Try -fno-inline on the suspect function or adjust inlining thresholds with -finline-limit. Compare .text section size with size binary before and after.
Symptom · 05
SIGILL after compiler upgrade
Fix
The new compiler may be generating AVX or AVX-512 instructions for a CPU that does not support them. Check with grep avx /proc/cpuinfo, then use objdump -d binary | grep -i vex to find the offending instructions. Add -mno-avx or -march=<baseline> to constrain the instruction set. This is common when upgrading compilers on build machines with newer CPUs than the deployment target.
Symptom · 06
Race condition only in optimized build
Fix
The instruction scheduler may have moved a load past a store to the same address. Rebuild with -fno-schedule-insns2 to disable post-register-allocation scheduling. If the race disappears, file a compiler bug — this is a correctness issue, not a tuning issue. Also run -fsanitize=thread to confirm the race independently of the scheduler fix.
Symptom · 07
Function returns incorrect value on ARM but not on x86
Fix
Check the calling convention difference. ARM AAPCS returns integers in r0; x86-64 System V uses rax. If you are calling across an FFI boundary without matching ABI declarations, the return value will be read from the wrong register. Verify with -mabi=aapcs on ARM and check structure layout — ARM requires natural alignment that x86 code sometimes violates silently.
★ Quick Debug: Compiler-Generated Code IssuesWhen your program behaves differently between debug and release builds, or you suspect a compiler bug, run these checks before diving into assembly.
Segfault only in optimized build
Immediate action
Rebuild with -O0 -g. If the crash disappears, reintroduce optimizations incrementally with -Og, then -O1, then -O2 to isolate the pass.
Commands
g++ -O2 -g -S -fno-move-loop-invariants main.cpp -o main.s && cat main.s | grep -A 30 '<function_name>:'
objdump -d -S a.out | grep -A 50 '<function_name>:'
Fix now
Add __attribute__((optimize("O0"))) on the suspect function to disable optimization for that function only, confirming the crash is optimization-induced before bisecting further.
Wrong floating-point result in release build+
Immediate action
Force SSE2 and disable excess precision. If the result changes, x87 80-bit precision was the cause.
Commands
g++ -mfpmath=sse -msse2 -fexcess-precision=standard -O2 -g -c main.cpp && g++ main.o -o test && ./test
objdump -d test | grep -E 'fld|fst|fstp|fmul|fadd' | head -20
Fix now
Add -mfpmath=sse -msse2 to CXXFLAGS permanently. If x87 instructions still appear in objdump output after this flag, a third-party library is pulling them in — find it with nm -D libfoo.so | grep __x87.
Stack smashing detected after function call+
Immediate action
Check for ABI mismatch between caller and callee — especially if the library was compiled with a different compiler version or flags.
Commands
g++ -fstack-protector-strong -fasynchronous-unwind-tables -g -O2 main.cpp -o main && ./main 2>&1
readelf -a libfoo.so | grep -E 'ABI|Tag' && nm -D libfoo.so | grep ' T ' | head -20
Fix now
Recompile the library with the same compiler and flags as the caller. If that is not possible, use __attribute__((sysv_abi)) or __attribute__((ms_abi)) on the function declaration to make the convention explicit.
Performance regression after compiler upgrade — spills in hot loop+
Immediate action
Confirm the spill is new by comparing -Rpass=regalloc output between old and new compiler. Do not assume — verify first.
Commands
clang++ -O2 -Rpass=regalloc -c hotloop.cpp 2>&1 | grep -i spill
perf stat -e cycles,instructions,cache-misses,stalled-cycles-backend ./binary
Fix now
Reduce live variable count in the hot loop: hoist constants out of the loop body, split the loop if more than 10 variables are live simultaneously, or use __attribute__((optimize("O1"))) on that function to revert to the less aggressive allocator while keeping -O2 elsewhere.
SIGILL when calling function compiled with AVX on older CPU+
Immediate action
Confirm the CPU does not support AVX, then find the offending instruction in the binary.
Commands
grep -m1 avx /proc/cpuinfo || echo 'AVX not supported'
objdump -d binary | grep -i 'vex\|ymm\|zmm' | head -20
Fix now
Add -mno-avx -mno-avx2 to CXXFLAGS for portable builds targeting mixed CPU generations. For builds that need AVX on capable machines, use CPU dispatch with __attribute__((target("avx2"))) on the specific function rather than enabling it globally.
Random memory corruption in multithreaded JIT+
Immediate action
Check whether code buffer writes are synchronized. JIT compilers that compile on multiple threads and write to a shared executable buffer without flushing the instruction cache will see stale instructions executed.
Commands
valgrind --tool=helgrind ./binary 2>&1 | head -50
gdb -ex 'break __clear_cache' -ex 'commands 1' -ex 'bt' -ex 'end' -ex run ./binary
Fix now
Serialize writes to the code buffer or use atomic stores. After writing new code bytes, call __builtin___clear_cache(start, end) on the written range before marking the region executable — skipping this step causes the CPU to execute stale I-cache lines.
Function returns wrong value after inlining+
Immediate action
Compare assembly with and without inlining to find the register clobbering the return value.
Commands
g++ -O2 -fno-inline -g -S main.cpp -o no-inline.s
diff no-inline.s <(g++ -O2 -g -S main.cpp 2>/dev/null) | grep '^[<>]' | head -30
Fix now
Add __attribute__((noinline)) to the suspect function to confirm inlining is the cause. If confirmed, the inlined function is likely clobbering a callee-saved register that the compiler did not account for — check the function's register usage in the non-inlined assembly.
Undefined reference to standard library symbol after compiler upgrade+
Immediate action
The new compiler may have changed the default C++ standard library version or ABI. Check the search paths and symbol availability explicitly.
Commands
g++ -E -x c++ - -v < /dev/null 2>&1 | grep -A 20 '#include <...>'
nm -D /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep <symbol_name>
Fix now
Add -stdlib=libstdc++ explicitly to prevent the compiler from defaulting to a different ABI variant. If cross-compiling, verify the sysroot includes the matching runtime library version.
🔥

That's Compiler Design. Mark it forged?

17 min read · try the examples if you haven't

Previous
Semantic Analysis
5 / 9 · Compiler Design
Next
Symbol Table in Compiler