Daily Low-Level Programming: Page Tables and the TLB

Page Tables and the TLB

2026-04-21

Every pointer you use in C is a virtual address. The CPU must translate it to a physical address before anything hits RAM. This translation happens through page tables — hierarchical lookup structures maintained by the OS and walked by hardware.

On x86-64, the standard is a 4-level page table (PML4 → PDPT → PD → PT). Each level indexes into a 4KB page of 512 entries (each entry is 8 bytes, 512 × 8 = 4096). A 48-bit virtual address is sliced into four 9-bit indices plus a 12-bit page offset:

Bits 47–39: PML4 index (512 entries)
Bits 38–30: PDPT index
Bits 29–21: PD index
Bits 20–12: PT index
Bits 11–0: offset within the 4KB page

Rule of thumb: A single page table walk costs 4 memory accesses. At ~100ns per DRAM access, an uncached translation costs ~400ns — roughly 1000 CPU cycles on a 2.5GHz processor. That's catastrophic if it happens on every load/store.

This is why the TLB (Translation Lookaside Buffer) exists. It's a small, fast cache of recent virtual-to-physical mappings. A typical L1 dTLB holds 64 entries and hits in 1 cycle. L2 TLB might hold 1536 entries with ~7 cycle latency. A TLB miss triggers a hardware page walk.

Real-world consequence: Suppose you're iterating over a 256MB array. With 4KB pages, that's 65,536 pages — far exceeding TLB capacity. You'll suffer constant TLB misses. Switch to 2MB huge pages (via mmap with MAP_HUGETLB or transparent huge pages), and the same array spans only 128 entries — comfortably fitting the TLB. Database engines like PostgreSQL expose huge page configuration for exactly this reason.

Each page table entry isn't just an address. It contains permission bits: present, read/write, user/supervisor, no-execute (NX). This is how the OS enforces memory protection. Writing to a read-only page triggers a page fault (interrupt 14 on x86), which the kernel handles — either killing the process, performing copy-on-write, or loading a page from swap.

When the OS context-switches between processes, it loads a new PML4 base address into the CR3 register. Historically, this flushed the entire TLB. Modern CPUs support PCID (Process-Context Identifiers) — a 12-bit tag that lets TLB entries from different address spaces coexist, avoiding costly flushes. This became critical after the Meltdown mitigation (KPTI) doubled the frequency of CR3 switches.

You can observe TLB behavior directly with perf stat -e dTLB-load-misses,iTLB-load-misses on your binary. If dTLB misses are high relative to total loads, huge pages or restructuring your access patterns will help more than any algorithmic optimization.

See it in action: Check out Page Tables and MMU: How Virtual Memory Actually Works Behind the Scenes (Animation) by BitLemon to see this theory applied.

Key Takeaway: Virtual-to-physical translation requires a 4-level page table walk costing ~400ns, so the TLB acts as a critical cache — understanding TLB capacity and using huge pages are essential tools for high-performance memory access.