Compiler Code Generation — When Register Spills Corrupt
A spilled register overlapped a stack frame by one byte, causing silent memory corruption.
- Code generation turns IR into target machine instructions — every register assignment, stack layout, and instruction encoding is decided here.
- Register allocation maps virtual to physical registers; spilling to memory in a hot loop has caused up to 10x slowdown in memory-bandwidth-limited workloads.
- Instruction selection picks the cheapest pattern covering each IR operation — cost models are micro-architecture-specific and wrong assumptions here show up as production regressions.
- Peephole optimizations polish generated code locally after initial emit — they rarely introduce bugs but interact badly with schedulers in ways that are hard to reproduce.
- Production insight: most compiler CVEs and release-build-only crashes trace back to codegen, not the frontend.
- Debugging: use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to trace spill decisions; use -fverbose-asm to map assembly back to source lines.
Imagine you write a recipe in English, then a professional chef translates it into precise kitchen instructions for a specific restaurant's equipment — listing exact burner numbers, which pan to use, in what exact order. Code generation is that translation step: your program has already been understood and optimized in a machine-independent form, and now the compiler writes precise CPU instructions for the exact hardware it's targeting. The CPU doesn't speak Python or Java — it speaks binary opcodes — and code generation is the compiler's job of bridging that gap. The translation looks mechanical but it isn't: the chef has to decide which burner to use when all six are occupied, what to do when the pan specified doesn't exist in this kitchen, and how to reorder steps so nothing burns while something else is resting. Get those decisions wrong and the dish comes out wrong — even if the recipe was perfect.
Your IDE's compile button triggers a pipeline that, in milliseconds, translates source code into native instructions. The final stage — code generation — makes the concrete decisions: which register, which instruction encoding, which stack layout. Get it wrong and you'll see silent data corruption, security vulnerabilities, or a 3× slowdown on a hot loop. That's the production reality. Most compiler CVEs trace back to codegen, not the frontend. So when you debug a crash that only appears in release builds, start with the generated assembly, not your source logic. Understanding code generation is a senior-level debugging superpower — not because you'll rewrite a compiler, but because you'll know exactly where to look when the compiler betrays you.
What Is Code Generation?
Before we get into the mechanics, here is a concrete demonstration of what code generation actually does. Take this three-line C function:
``c int add(int b, int c) { return b + c * 2; } ``
On x86-64 Linux with -O1, clang turns it into three instructions. On ARM64 with the same flags, it becomes two different instructions. Same source, same semantics, completely different machine output — because code generation is target-specific by definition. That is the problem it solves.
At its core, code generation takes the compiler's intermediate representation (IR) — a machine-independent, low-level program description — and translates it into the actual instruction set of the target CPU. This step decides every concrete detail: which CPU register holds a variable, which addressing mode to use for a memory access, how to arrange the stack frame, which of the dozens of x86 instruction encodings to pick for a simple addition, and whether a 64-bit constant fits in an immediate field or needs to be loaded from memory.
What makes this hard? The compiler must produce correct output for every possible IR input, while also squeezing out wasted cycles. A single incorrect register assignment corrupts the entire program state. The generated code must also respect the target's ABI — calling conventions, data alignment, exception handling unwind tables — otherwise a C++ function cannot talk to an assembly library correctly.
That ABI constraint is where production bugs hide. I spent two days once debugging why a C library function returned garbage values on ARM. The code generator assumed struct alignment matched x86 conventions. It did not. Not a bug in the algorithm — a bug in the code generator's assumptions about the target. The fix was one line: __attribute__((aligned(8))) on the struct. Three weeks of confusion, one attribute.
In modern compilers like LLVM and GCC, code generation is split into multiple sequential passes. Each pass can be inspected independently. Use -fopt-info in GCC or -Rpass=.* in LLVM to trace which pass transformed your code. When you hit a release-build-only crash, bisecting these passes — not your source code — is the fastest path to the root cause.
One trap that catches teams repeatedly: the code generator must handle unreachable code correctly. A basic block that the optimizer proves unreachable is still processed by the code generator, which inserts a trap instruction (ud2 on x86, udf on ARM) in its place. If the CPU speculatively executes that path — which happens more than you'd think on modern out-of-order processors — you get a crash at a location that looks impossible from your source. When you see a crash in a location your source logic says can never be reached, look for a trap instruction in the disassembly.
Another thing you will not find in textbooks: ELF relocations and PIC (position-independent code) decisions are made during code generation. If you see 'relocation truncated to fit' at link time, that is the code generator choosing a relocation type whose offset range the final binary layout violated. Check the relocation table with readelf -r before assuming it is a linker bug — it usually is not.
Intermediate Representation — The Compiler's Lingua Franca
Before code generation runs, the compiler represents your program in an Intermediate Representation. IR sits between the source language and the target machine — it is like an assembly language that has never heard of x86 or ARM. Common IR forms include three-address code (TAC), static single assignment (SSA), and stack-based bytecode (JVM, Python's CPython bytecode).
Why does the compiler need IR at all? Because it lets the hard work of language analysis — parsing, type checking, semantic analysis — happen once. That analysis produces IR, and then every optimization (dead code elimination, constant propagation, loop invariant hoisting) runs on the IR. All of these optimizations benefit every language that targets the IR, and they produce output that every backend can consume. LLVM supports over 30 CPU architectures from a single IR. The frontend for C, C++, Rust, Swift, and Julia all emit the same LLVM IR. That is the payoff.
LLVM's IR is SSA-based: every virtual register is assigned exactly once, and every use of a value traces back to exactly one definition. This single property makes dataflow analysis — the foundation of most optimizations — trivially correct. If you can see a use of %x, you know exactly one instruction produced %x. No aliasing, no ambiguity.
SSA introduces phi nodes at control flow merge points. A phi node says: 'this value is either A (if we came from block 1) or B (if we came from block 2).' With large switch statements, you can generate tens of thousands of phi nodes. This blows up IR memory and slows down every subsequent pass. The fix is to run simplifycfg early, which merges redundant blocks and eliminates unnecessary phi nodes. If compile times spike after adding a large switch, check your IR stats with opt -stats and look at the phi node count.
GCC uses multiple IR levels rather than one: GENERIC (close to the AST), GIMPLE (statement-level, SSA), and RTL (register transfer language, close to assembly). Each level enables different optimizations. GIMPLE is where most GCC optimizations run. RTL is where the register allocator, instruction scheduler, and peephole optimizer work. If you see a discrepancy between GIMPLE output and the final assembly, check the RTL dumps with -fdump-rtl-all — the transformation happened somewhere in that layer.
A critical detail that custom compiler authors consistently skip: debug metadata. Every IR instruction should carry a source file, line, and column annotation. In LLVM IR, this is the !dbg metadata attached to every instruction. Without it, gdb shows wrong line numbers, crash dumps point to the wrong function, and production post-mortems take three times as long. I once worked with a machine learning compiler that skipped debug metadata to simplify the IR generation code. Six months later, the team spent two weeks debugging a segfault that would have taken two hours with correct line information. Invest in debug metadata from day one.
IR must also be legalized before code generation: operations that the target hardware does not support natively must be expanded or split. A 64-bit multiply on a 32-bit target becomes a sequence of 32-bit operations. A floating-point comparison on a target with no hardware FPU becomes a software library call. Legalization is a phase that runs inside the code generator, after instruction selection has chosen patterns but before the final encoding. Bugs here look like wrong results on one target but not another — classic cross-platform correctness issues.
Finally, monitor IR density. If you are running a custom optimization pipeline, use opt -stats to track instruction counts per function. A function with more than 100,000 IR instructions usually indicates missed optimization — an inliner that fired too aggressively, an unrolling pass that ran without a size limit, or a frontend that failed to clean up dead globals. I have seen a compiler generate IR 5x larger than necessary because the frontend did not remove constant arrays after lowering. The register allocator spilled heavily as a result. One DCE pass before code generation fixed it.
- Frontend: source language → IR. Any language can target the same IR — C, Rust, Swift, Julia all emit LLVM IR.
- Backend: IR → machine code. Any CPU with a backend can consume the IR — x86, ARM, RISC-V, WebAssembly.
- Optimizations operate on IR, so every language gets them for free and every target benefits.
- SSA form — each variable assigned exactly once — makes dataflow analysis correct by construction. This is why LLVM optimizations are so powerful.
- Debug metadata in IR is not optional. Without !dbg annotations on every instruction, your production crash dumps are useless.
Register Allocation — Where Performance Is Won or Lost
Register allocation is the most performance-critical and most complex part of code generation. The IR has an unlimited supply of virtual registers. The target CPU has a fixed, small set of physical registers — 16 general-purpose on x86-64, 31 on ARM64. The allocator's job is to map the unlimited to the finite, and when there is not enough room, decide which values to spill to memory and when.
Graph coloring is the classic algorithm: treat each virtual register as a graph node, connect any two nodes that are simultaneously live (interference edges), and color the graph with K colors where K equals the number of physical registers. If two nodes share an edge, they cannot share a color — they cannot share a register. If the graph cannot be K-colored, some nodes must be spilled: their value is written to a stack slot and reloaded when needed. Graph coloring in its general form is NP-hard, so compilers use heuristics — Chaitin's algorithm, Briggs' improvement, and LLVM's greedy allocator are all approximations that work well in practice.
Linear scan is the alternative: instead of building an interference graph, sort live ranges by start point and greedily assign registers as they begin and expire. It is faster to compile but produces worse code than graph coloring. V8's Crankshaft used linear scan; most AOT compilers prefer graph coloring variants. HotSpot's C2 JIT uses a graph-coloring allocator because the compilation time cost is worth the runtime payoff for code that runs for hours.
PBQP (Partitioned Boolean Quadratic Programming) formulates allocation as a combinatorial optimization problem and can produce near-optimal results. LLVM deployed PBQP for ARM targets in production and it was the default allocator for that target for several years. It was removed in LLVM 16 — not because it was wrong, but because the greedy allocator had improved enough that PBQP's compile-time overhead no longer justified the marginal code quality gain. Understanding PBQP is still valuable for reasoning about what optimal allocation looks like and why the greedy allocator sometimes falls short.
The impact of spilling is not linear. A spilled loop counter generates two memory operations per iteration — one load, one store — replacing what was a single register increment. On a memory-bandwidth-limited workload, that can cause measured slowdowns of 5x to 10x on the hot path. I debugged a performance regression in a high-frequency trading system where a compiler upgrade changed the greedy allocator's split heuristics and caused a loop counter to spill. The counter was updated every iteration. The throughput dropped 40%. The fix was not to change the algorithm — it was to split the loop into a setup phase with high register pressure and a tight computation phase with low pressure. That restructuring dropped the live variable count across the iteration boundary from 14 to 9, which was enough for the allocator to keep the counter in a register.
A rule of thumb that holds on x86-64: if your inner loop has more than 12 live variables simultaneously, you will likely see spills. Use -Rpass=regalloc (LLVM) or -fopt-info-all (GCC) to confirm. The output will tell you exactly which virtual register spilled and at which line. That is your starting point for restructuring.
Rematerialization is the allocator's alternative to spilling: instead of storing a value to the stack and reloading it, recompute it from constants or from values that are already in registers. Constants, loop invariants, and values derived only from constants are all candidates. LLVM's greedy allocator performs rematerialization automatically for many patterns. When you see a spill in -Rpass output and the spilled value looks cheap to recompute, add a comment in the bug report — the allocator may be missing a rematerialization opportunity.
Do not underestimate the interaction between inlining and register pressure. Inlining a function that has 8 live variables into a caller that already has 10 can push the combined live count above the spill threshold. If you see a performance regression after enabling -finline-functions, check register pressure before and after with -Rpass=regalloc. Sometimes the right answer is to not inline a function even if it is small, because the register pressure cost exceeds the call overhead savings.
One more: floating-point and SIMD register allocation are separate from integer allocation on x86-64. There are 16 XMM registers (SSE/AVX) and the legacy x87 stack. If you mix x87 and SSE code paths — which happens when an old library uses x87 and your code uses SSE — the allocator manages two distinct register files and can produce unexpected spills at the boundary. Force SSE with -mfpmath=sse to unify the model. I have seen a 2x slowdown in a physics simulation that traced to x87/SSE boundary spills. One compiler flag fixed it.
Instruction Selection — Picking the Right Instruction for the Job
Instruction selection is the phase that maps IR operations to actual CPU instructions. A single IR add operation could become an x86 add, a lea, a fused multiply-add, or a shift-and-add depending on the operands, the surrounding context, and the target micro-architecture. The selector's job is to find the cheapest instruction sequence that correctly implements each IR operation.
Tree-based pattern matching is the standard approach: represent a basic block's IR as a tree of operations, then match subtrees against instruction patterns defined in the target description. Each pattern has an associated cost. Dynamic programming finds the minimum-cost cover of the entire tree — this is where the analogy to tiling a floor with shaped tiles comes from. LLVM formalizes this in TableGen: you write instruction definitions and patterns in .td files, and TableGen generates C++ code for the selector. Adding a new instruction to an LLVM backend is editing a configuration file, not writing a thousand lines of selector code.
LLVM's SelectionDAG generalizes tree matching to a directed acyclic graph, which handles operations that produce multiple results (like a division that yields both quotient and remainder) and operations that have side effects (like stores). SelectionDAG also performs legalization: operations the target cannot execute natively are expanded or split during selection. A 64-bit integer divide on a 32-bit ARM target becomes a software library call; a vector type wider than the hardware supports is split into two narrower operations. If a correctness bug only appears on one target, legalization is the first place to look.
LLVM is actively replacing SelectionDAG with GlobalISel, a newer instruction selection infrastructure that operates on a lighter Machine IR (MIR) representation. GlobalISel is the default for AArch64 as of LLVM 12 and has been production-stable on that target since. For x86, SelectionDAG remains the default in 2026 but GlobalISel support is mature. If you are building a new LLVM backend today, GlobalISel is the right starting point — the infrastructure is cleaner and the long-term maintenance burden is lower.
Cost models are where instruction selection gets subtle. On Intel Skylake, simple add executes on ports 0, 1, 5, and 6 — four execution ports. Simple lea (one or two components, no scale factor) executes on ports 1 and 5 — two ports. The selector prefers add for simple increments because it has higher port availability and thus better throughput under superscalar execution. Complex lea (three components or a scale factor) executes only on port 1 and has higher latency on some micro-architectures — avoid it in loops unless you genuinely need the address computation. These details live in Agner Fog's instruction tables and the vendor optimization manuals; the compiler's cost model approximates them, but the approximation is sometimes wrong for your specific CPU.
The inc instruction is the classic example of instruction selection getting burned by micro-architecture details. On the Pentium 4 and some early Core processors, inc reads and writes the flags register but does not update the carry flag — this creates a false dependency on the carry flag from a previous instruction, stalling the pipeline. add $1, reg does not have this dependency. The cost model in the compiler for those micro-architectures correctly prefers add over inc. If you are targeting a CPU that predates reliable cost model support, test with -mtune set explicitly to your CPU family.
Constant materialization is another instruction selection challenge. On x86, a 32-bit immediate fits directly in the instruction encoding. A 64-bit immediate requires movabs — a longer encoding. If a large constant appears many times in a hot loop, the selector may choose to load it from a literal pool in memory rather than rematerialize it each time. Whether that is correct depends on cache pressure and the number of uses. If you see a hot path loading a constant from memory in a tight loop, move the constant to a local variable assigned before the loop — most compilers will then rematerialize it in a register inside the loop.
Intrinsics are the escape hatch when the selector makes the wrong choice. If profiling confirms that the compiler is not emitting a FMA instruction for a multiply-accumulate pattern, use _mm_fmadd_ps explicitly rather than hoping the auto-vectorizer will find it. Intrinsics sacrifice portability but give you precise control over the emitted instruction. The rule is: measure first, use intrinsics only when profiling confirms the generic selection is provably wrong.
- Each CPU instruction is a tile: it covers a pattern of IR operations (add, load, multiply) and has an associated cost (latency, throughput, code size).
- Optimal tiling finds the minimum-cost cover for the entire IR tree — dynamic programming solves this in O(n) for trees.
- Modern selection uses a DAG (not just a tree) to handle shared subexpressions and multi-output operations like divmod.
- Cost models are micro-architecture-specific. What is cheapest on Skylake may be suboptimal on Zen 4. Use -mtune=native to match the model to your hardware.
- When the selector gets it wrong (and it will), intrinsics are your override — but only use them after profiling confirms the generic selection is provably suboptimal.
Peephole Optimization — The Final Polish
Peephole optimization is the last cleanup pass in code generation. It examines a small window of consecutive instructions — typically two to five — and replaces sequences with equivalent but cheaper or shorter alternatives. It catches patterns that earlier passes left behind: a register initialized and never read, two adjacent stack adjustments that could be merged, a conditional jump to the very next instruction that could become a fall-through.
GCC runs two peephole passes: -fpeephole handles simple one-to-one replacements during RTL generation, and -fpeephole2 runs after register allocation on a slightly larger window. LLVM has a PeepholeOptimizer pass that runs early in the machine code pipeline and a separate DeadMachineInstructionElim pass to remove the dead instructions it identifies. Both compilers also run a later scheduling pass that can surface additional peephole opportunities.
The most impactful peephole patterns in practice are copy propagation and dead copy elimination. Copy propagation rewrites uses of a copy destination to use the copy source directly — this is what turns:
```asm ; Before: two instructions, indirect read mov eax, ebx mov ecx, eax ; reads eax, but we could read ebx directly
; After copy propagation: the copy source is used directly mov eax, ebx mov ecx, ebx ; ecx now reads from ebx — eax copy is now dead ```
Once the second mov reads from ebx directly, the first mov may become dead — nothing reads eax anymore. Dead copy elimination then removes it entirely:
``asm ; After dead copy elimination: one instruction mov ecx, ebx ``
These two patterns together are the most common reason peephole produces visible code size reduction. They do not change semantics — they just remove indirection that earlier passes introduced.
Strength reduction is another peephole domain. On targets where shift is cheaper than multiply, the peephole pass may replace mul by a power of two with a shift. On embedded targets without a hardware multiplier, this is a correctness-adjacent optimization — the semantic result is identical but the performance difference is large. The key point is that the cost model must be accurate for the target; a strength reduction that helps on ARM Cortex-M0 may be neutral on x86.
Performance interacts with peephole in non-obvious ways. Removing an instruction saves code size and reduces decode pressure, but it can also change the dependency chain visible to the out-of-order engine. I once saw a case where removing a redundant mov eliminated a dependency break that the CPU's rename unit was using to parallelise two chains. The net result: one fewer instruction, but two cycles slower per iteration. The lesson is to always measure with perf stat after enabling or disabling peephole passes — saved instructions do not automatically mean saved cycles.
If you are debugging a regression that appears at -O2 but not at -O1, test with -fno-peephole2 (GCC) to isolate the second peephole pass. In LLVM/Clang, there is no single stable flag to disable the peephole pass in isolation. The practical approach is to compare -O1 and -O2 assembly output for the hot function and look for the specific transformation that changed. Use -fdump-rtl-all (GCC) or -mllvm -print-machineinstrs (Clang) to see the instruction stream at each stage of the machine code pipeline.
When writing a custom backend, implement peephole optimizations incrementally and measure each one. Start with the ten patterns that appear most frequently in your IR — dead copy elimination, redundant compare elimination, and branch-to-next-instruction removal are almost always in the top five. Each pattern is cheap to implement; the discipline is measuring before adding the next one. An unmeasured peephole pass is a maintenance liability.
The Spilled Register That Corrupted a Medical Device
- A spilled register can silently corrupt memory if the stack frame layout is off by a single byte — and the symptoms look exactly like a hardware fault.
- Use -fverbose-asm with -g to map assembly back to source lines the moment you suspect a codegen issue. It is the fastest path from symptom to root cause.
- Never trust a compiler upgrade in a safety-critical system without running differential assembly comparison on every hot function. A spill location change is a correctness change.
- Run Csmith differential fuzzing before deploying a new compiler version to embedded targets — it catches miscompilations that integration tests miss because they test program behaviour, not generated code correctness.
That's Compiler Design. Mark it forged?
17 min read · try the examples if you haven't