2026-04-22
Every memory access your program makes uses a virtual address. Before anything hits the cache or DRAM, the CPU must translate that virtual address to a physical one via the page table. The problem: page tables live in memory, and a full walk through a 4-level page table (as on x86-64) costs four sequential memory accesses. At ~100 cycles per DRAM access, that's 400 cycles of latency on every load and store. This is where the Translation Lookaside Buffer (TLB) saves you.
A TLB is a small, fully-associative (or high set-associative) cache that stores recent virtual-to-physical page mappings. Modern CPUs use a split, multi-level TLB hierarchy:
Concrete example: An Intel Alder Lake P-core has a 96-entry L1 DTLB and a 2,048-entry L2 STLB. With 4KB pages, the L1 DTLB covers 96 × 4KB = 384KB of address space. The L2 STLB covers 2,048 × 4KB = 8MB. If your working set exceeds 8MB of randomly-accessed pages, you start suffering full page walks on every TLB miss.
Rule of thumb: A TLB miss with a 4-level page walk costs roughly 4× your L2/L3 latency (the page walk accesses are themselves cacheable in the data caches, which helps enormously). On a modern CPU, expect 20–50 cycles for a cached page walk, but 200+ cycles if the page table entries aren't in cache.
This is exactly why huge pages (2MB or 1GB) matter for performance. A 2MB huge page lets each TLB entry cover 512× more address space. That same 96-entry L1 DTLB now covers 192MB instead of 384KB. Databases like PostgreSQL and runtimes like the JVM use huge pages specifically to reduce TLB pressure.
Hardware also helps with a page walk cache (also called a Paging Structure Cache), which caches intermediate levels of the page table. Intel CPUs cache PML4, PDPT, and PD entries separately, so a "miss" often only needs to fetch the final PT level — turning a 4-access walk into a 1-access walk.
TLB shootdowns are the dirty secret of multicore: when one core updates a page table mapping, all other cores that might have cached that mapping must be interrupted via IPI (inter-processor interrupt) to invalidate their TLB entries. This is why frequent mmap/munmap in multithreaded code can silently kill performance — each unmap triggers a cross-core TLB shootdown that stalls every core for microseconds.
