2026-06-03
For thirty years, the boundary of the CPU's coherent world was the socket. Inside: cache lines, MESI, atomic ops. Outside (across PCIe): DMA, software-managed buffers, and "hope you flushed." CXL (Compute Express Link) erases that boundary by running cache-coherent protocols over PCIe physical layers.
CXL multiplexes three protocols on the same wire:
The three CXL "types" are just which subset a device implements. Type 1 (smart NIC) uses .io + .cache. Type 2 (GPU/accelerator with local memory) uses all three. Type 3 (memory expander/pooling box) uses .io + .mem.
Real example: a Type 3 CXL memory expander. You plug a card with 512 GB of DDR5 into a PCIe 5.0 x16 slot. Linux exposes it as a separate NUMA node with no CPUs — just memory. You can numactl --membind=2 ./app and the kernel allocates from the expander. Loads and stores work normally; the CPU's home agent forwards requests over CXL.mem to the expander's controller, which reads its DRAM and ships the line back.
Latency rule of thumb: local DDR5 is ~80 ns. Remote-socket NUMA is ~130 ns. CXL-attached memory on the same host is ~170–250 ns — roughly 2–3× local, comparable to a far-NUMA hop. PCIe 5.0 x16 delivers ~64 GB/s per direction, which is one channel of DDR5's worth of bandwidth. So CXL memory is a capacity tier, not a bandwidth tier — perfect for cold pages, in-memory database overflow, or pooled memory shared between hosts.
The killer trick is memory pooling: a rack-level CXL switch lets multiple hosts carve slices out of a shared memory appliance. When host A's footprint shrinks, host B gets the capacity. Hyperscalers care about this because DRAM is now ~50% of server BOM and stranded memory is pure waste.
The coherence story matters even for Type 3: because the CPU treats CXL memory as cacheable WB, you can run unmodified software. No DMA dance, no cudaMemcpy, no explicit flushes. That's the whole point — coherence outside the socket without rewriting the software.
