The Micro-Op Cache: Why Your CPU Decodes Each Instruction Only Once

2026-05-20

x86 instructions are a nightmare to decode: variable length (1–15 bytes), prefix soup, and a legacy from the 1970s. Modern Intel and AMD CPUs hide this with the micro-op cache (Intel calls it the DSB — Decoded Stream Buffer; AMD calls it the op cache). It stores instructions after they've been decoded into µops, so the legacy decoder pipeline only fires on cold code.

The legacy pipeline can decode at most 4–5 instructions per cycle, bottlenecked by a single complex decoder and several simple ones. The µop cache delivers 6 µops per cycle with much lower power and zero decoder pressure. For a hot loop, this is the difference between front-end-bound and back-end-bound execution.

How it's organized (Skylake-era): 32 sets × 8 ways, each entry holds up to 6 µops covering an aligned 32-byte window of code. Critical constraints:

Real example — the JCC erratum (2019): Intel discovered that certain jumps crossing or ending on 32-byte boundaries could corrupt the µop cache. The microcode fix simply refused to cache any instruction touching the last byte of a 32-byte chunk. Suddenly, hot loops that happened to straddle these boundaries fell back to the legacy decoder and lost 5–20% performance with no source change. GCC and LLVM added -mbranches-within-32B-boundaries to nudge the assembler into padding around jumps. Real workloads (Redis, MySQL benchmarks) saw measurable regressions on patched Skylake systems until rebuilt.

Rule of thumb: if a hot loop fits in ≤ ~500 bytes of code and avoids microcoded instructions, it will stream from the µop cache. You can confirm with perf stat -e idq.dsb_uops,idq.mite_uops: a healthy hot loop has DSB µops >> MITE (legacy decoder) µops, often a 20:1 ratio. If MITE is climbing, your loop spilled out of the µop cache — usually because it grew past the capacity or hit a microcoded instruction.

This is why compiler flags like -falign-loops=32 exist and why micro-benchmarks of identical algorithms can differ by 15% based on link order alone. The CPU isn't running your assembly — it's running a cached translation of it.

See it in action: Check out The Fetch-Execute Cycle: What
#39;s Your Computer Actually Doing? by Tom Scott to see this theory applied.
Key Takeaway: Hot code runs from a post-decode µop cache, not the decoders — so code size, 32-byte alignment, and avoiding microcoded instructions determine front-end throughput more than instruction count does.