2026-06-04
When a TLB miss happens, the page miss handler doesn't just walk all four levels of an x86-64 page table from scratch. That would cost four dependent DRAM accesses — potentially 400+ cycles. Instead, modern CPUs keep a Page Walker Cache (PWC), sometimes called a paging-structure cache, that stores recently-traversed intermediate page table entries: PML4, PDPT, and PD entries. The leaf PTE goes into the regular TLB, but the upper-level pointers live here.
The win is enormous because the upper levels of the page table have massive coverage. A single PML4 entry covers 512 GB of virtual address space. A single PDPT entry covers 1 GB. A single PD entry covers 2 MB. Even a tiny PWC with 32 entries per level can cover essentially the entire working set of upper-level translations for most workloads.
The structure on Intel Skylake-class cores:
On a TLB miss, the walker probes the PWC from the deepest level downward. If the PDE for the missing address is cached, the walker only needs one DRAM access (the leaf PTE) instead of four. That turns a ~400-cycle miss into a ~100-cycle miss.
Real-world example: Redis serving a 40 GB key-value store with 4 KB pages. Without PWC, every TLB miss on a cold key would chain through PML4 → PDPT → PD → PT, each potentially missing in L1/L2/L3. With the PWC, the PD entries covering hot regions stay resident, and most TLB misses resolve with a single PTE fetch. This is why huge pages (2 MB) help so much: they not only reduce TLB pressure but they let leaf translations live in what would otherwise be the PD cache slot, collapsing the walk to just three levels and often hitting entirely in cache.
Rule of thumb: A page walk with a warm PWC costs roughly (number of uncached levels) × ~30 cycles. Cold walk = 4 levels × ~30 = ~120 cycles minimum, often 300+ with DRAM misses. Warm PWC walk = 1 level = ~30–80 cycles.
The PWC is also why INVLPG is more expensive than it looks: it invalidates not just the TLB entry but potentially the intermediate paging-structure caches too. And it's why context switches that flush the TLB (without PCID) hurt so much — the PWC often gets nuked alongside.
