2026-05-12
Every CPU has separate L1 caches for data (L1d) and instructions (L1i) — typically 32KB each on x86. You optimize data layout religiously, but your code competes for the same scarce resource, and an L1i miss stalls the front-end before any work can happen.
The L1i is filled by fetching 64-byte lines from L2 (or further). When you call a function, the CPU fetches the line containing its entry point. If that function calls another cold function on a distant page, you take an i-cache miss, possibly an iTLB miss, and the pipeline starves. Modern CPUs can issue 4–8 instructions per cycle, but only if the front-end keeps up.
How the compiler helps (and hurts):
.text section blow your i-cache. Tools like perf + BOLT or llvm-propeller reorder functions by call frequency..text and evicting other hot code. __attribute__((noinline)) on cold paths matters.__builtin_expect and PGO to put the hot path inline and push the cold path (error handling, logging) to the end of .text, often onto a separate page that's never fetched in steady state.Real-world example: Facebook's HHVM hit a wall where 30%+ of cycles were front-end stalls — pure i-cache pressure from a multi-MB hot working set. They built BOLT, which reorders functions and basic blocks using perf profiles. The result: 7–8% speedup on HHVM, ~5% on Clang itself, with zero source changes. Google sees similar wins with Propeller on Search.
Rule of thumb: If your hot working set of code exceeds ~32KB, you're thrashing L1i. Measure with perf stat -e L1-icache-load-misses,iTLB-load-misses. A miss rate over 1% on the hot path is a strong signal. For comparison: an L1i hit is ~4 cycles; an L2 hit is ~12; an LLC miss to DRAM is 200+. That's a 50× swing per fetched line.
Practical levers: Mark cold paths with __attribute__((cold)) — the compiler relocates them to a separate section (.text.cold). Use likely()/unlikely() macros. For real wins, enable PGO or AutoFDO; BOLT on top of that is the current state of the art. Avoid template explosions in hot loops — every instantiation is more bytes fighting for the same cache lines.
