ArXiv Paper Digest: Enhancing Instruction Prefetching via Cache and TLB Management

Enhancing Instruction Prefetching via Cache and TLB Management

2026-05-13

Authors: Alexandre Valentin Jamet, Georgios Vavouliotis, Marti Torrents, Dimitrios Chasapis

If you've ever wondered why big server applications — databases, web services, microservices — sometimes feel sluggish even on monster hardware, a lot of the blame lands on something called the front-end of the CPU. Modern processors don't just blindly execute instructions one at a time; they aggressively fetch upcoming instructions ahead of time, hoping to keep the execution pipeline fed. That fetching process is called instruction prefetching, and when it works well, your CPU stays busy. When it doesn't, your CPU spends a shocking amount of time twiddling its thumbs.

This paper digs into why today's instruction prefetchers — even sophisticated ones — leave a lot of performance on the table for server workloads, which are notorious for having enormous instruction footprints (millions of unique instructions, not just tight inner loops).

The authors identify two specific bottlenecks:

Address translation gets in the way. CPUs use virtual memory, so before fetching an instruction the hardware needs to translate the virtual address to a physical one using something called a TLB (Translation Lookaside Buffer — essentially a cache of recent translations). When a prefetch crosses a page boundary, the prefetcher has to wait for that translation, which kills the timing advantage prefetching is supposed to provide.
The L1 instruction cache wastes space on code that won't be reused soon. Not every prefetched instruction line gets used multiple times before being evicted. Treating all fetched lines as equally valuable means useful ones get kicked out by short-lived ones.

To fix this, the authors propose coordinating the instruction prefetcher more tightly with both the TLB and the L1 instruction cache. The prefetcher proactively warms up address translations before they're needed, so cross-page prefetches don't stall. And it informs the cache replacement policy about which prefetched lines are likely to be reused, so the cache can keep the valuable ones and evict the throwaway ones first.

The key insight is almost embarrassingly simple in hindsight: prefetching is a system-level problem, not just a prediction problem. You can perfectly predict which instructions will be needed, but if the supporting machinery — translation, cache replacement — isn't on the same page (sometimes literally), the prediction doesn't translate into speed.

Why it matters: Server workloads dominate datacenter spending, and front-end stalls are a major reason CPUs underperform on them — fixing the plumbing around the prefetcher, rather than just the prefetcher itself, is a pragmatic path to real-world speedups.

All newsletters