Daily Low-Level Programming: The Instruction Cache: Why Code Layout Matters as Much as Data Layout

The Instruction Cache: Why Code Layout Matters as Much as Data Layout

2026-05-12

Every CPU has separate L1 caches for data (L1d) and instructions (L1i) — typically 32KB each on x86. You optimize data layout religiously, but your code competes for the same scarce resource, and an L1i miss stalls the front-end before any work can happen.

The L1i is filled by fetching 64-byte lines from L2 (or further). When you call a function, the CPU fetches the line containing its entry point. If that function calls another cold function on a distant page, you take an i-cache miss, possibly an iTLB miss, and the pipeline starves. Modern CPUs can issue 4–8 instructions per cycle, but only if the front-end keeps up.

How the compiler helps (and hurts):

Function ordering: The linker places functions in the order it sees them. Hot functions scattered across a 2MB .text section blow your i-cache. Tools like perf + BOLT or llvm-propeller reorder functions by call frequency.
Inlining: Helps by removing call overhead, but huge inlined functions hurt by bloating .text and evicting other hot code. __attribute__((noinline)) on cold paths matters.
Basic-block ordering: GCC/Clang use __builtin_expect and PGO to put the hot path inline and push the cold path (error handling, logging) to the end of .text, often onto a separate page that's never fetched in steady state.

Real-world example: Facebook's HHVM hit a wall where 30%+ of cycles were front-end stalls — pure i-cache pressure from a multi-MB hot working set. They built BOLT, which reorders functions and basic blocks using perf profiles. The result: 7–8% speedup on HHVM, ~5% on Clang itself, with zero source changes. Google sees similar wins with Propeller on Search.

Rule of thumb: If your hot working set of code exceeds ~32KB, you're thrashing L1i. Measure with perf stat -e L1-icache-load-misses,iTLB-load-misses. A miss rate over 1% on the hot path is a strong signal. For comparison: an L1i hit is ~4 cycles; an L2 hit is ~12; an LLC miss to DRAM is 200+. That's a 50× swing per fetched line.

Practical levers: Mark cold paths with __attribute__((cold)) — the compiler relocates them to a separate section (.text.cold). Use likely()/unlikely() macros. For real wins, enable PGO or AutoFDO; BOLT on top of that is the current state of the art. Avoid template explosions in hot loops — every instantiation is more bytes fighting for the same cache lines.

See it in action: Check out Caching - Simply Explained by Simply Explained to see this theory applied.

Key Takeaway: Your CPU caches code with the same brutal economics as data — lay out hot functions contiguously and exile cold paths, or the front-end will starve the back-end no matter how clean your inner loop looks.

All newsletters