Daily Hardware Architecture: The Uop Cache Hit Rate Cliff: Why Small Code Changes Cause Huge Performance Swings

The Uop Cache Hit Rate Cliff: Why Small Code Changes Cause Huge Performance Swings

2026-06-09

The decoded uop cache (Intel calls it the DSB, AMD calls it the op cache) is supposed to be a clean win — decode once, replay many times. But it has a brutal property: hit rate is bimodal. Code either lives almost entirely in the uop cache or almost entirely outside it, with very little middle ground. A 5% change in code size can cause a 40% performance swing.

The cliff comes from how the DSB indexes entries. On Skylake-derived cores, the DSB is organized as 32 sets × 8 ways × 6 uops per way, indexed by the linear address of a 32-byte instruction chunk. Three rules cause most cliffs:

The 32-byte boundary rule: A single 32-byte aligned chunk of code can only allocate up to 3 DSB ways (18 uops). Hit that limit and the entire chunk falls back to the legacy decoders — not just the overflow.
The "no straddling" rule: A single uop entry cannot describe instructions that cross a 32-byte boundary. Misalign a hot loop and you waste ways.
The unconditional jump terminates a way: A way ends at a taken branch, even if more uops would fit. Dense branching wastes capacity.

Real example: A SPECint benchmark loop that fit in 30 uops ran at 4.2 IPC sourced entirely from the DSB. Adding a single nop for alignment debugging pushed the loop across a 32-byte boundary, requiring 4 DSB ways instead of 3. The whole chunk was evicted to the MITE (legacy decode) path. IPC dropped to 2.6 — a 38% regression from a one-byte change.

The rule of thumb: (hot loop bytes / 32) rounded up, times 3 ways tells you the DSB ways consumed. If a 96-byte loop fits in 3 chunks × 3 ways = 9 ways out of 8 available in one set, you're already at risk. Aim for hot loops to occupy ≤ 2 aligned 32-byte chunks containing ≤ 18 uops each. Compilers know this — GCC's -falign-loops=32 and Intel's -falign-loops=32 -falign-functions=32 exist precisely to keep code on the DSB side of the cliff.

You can directly measure where you sit using perf stat -e idq.dsb_uops,idq.mite_uops. A healthy compute loop wants DSB uops at 90%+ of the total. Drop below 70% and you're paying for the legacy decoder's 4-wide bottleneck instead of the DSB's 6-wide delivery — a 33% front-end bandwidth haircut even before considering decode latency.

Key Takeaway: The uop cache has hard structural limits per 32-byte chunk, so trivial code-size changes can flip a loop from 6-wide DSB delivery to 4-wide legacy decode and tank IPC overnight.

All newsletters