Daily Hardware Architecture: Write Combining Buffers: How CPUs Batch Stores to Uncached Memory

Write Combining Buffers: How CPUs Batch Stores to Uncached Memory

2026-05-08

When your CPU writes to write-back cacheable memory, the cache absorbs the store and life is good. But when you write to write-combining (WC) memory — typically GPU framebuffers, MMIO regions, or memory mapped via ioremap_wc() — there's no cache line to absorb the write. Naively, every byte store would generate a separate transaction across the memory bus or PCIe link, which is catastrophically slow.

Enter the write combining buffer (WCB). Each core has a small set of these — typically 4 to 10 buffers, each 64 bytes wide (one cache line). When you store to a WC region, the CPU allocates a WCB, accumulates subsequent stores to the same line into that buffer, and only flushes it to the bus when:

The buffer fills completely (a full 64-byte line write — the fast path)
An sfence, mfence, or serializing instruction executes
You write to a different cache line than any open WCB and all buffers are occupied (forced eviction)
An interrupt or context switch occurs
A locked instruction or I/O operation forces ordering

The key property: writes within a buffer can be merged and reordered. WC memory is explicitly not strongly ordered, which is why it's fast — and why it requires explicit fences when ordering matters (e.g., before ringing a doorbell register).

Real-world example — GPU command buffers. When a Vulkan or DirectX driver writes commands to a GPU's command ring, that ring is mapped WC. The driver writes 64-byte command packets sequentially. If each packet is exactly 64 bytes and naturally aligned, the WCB fills with one packet and flushes as a single PCIe write — one TLP instead of 16 separate 4-byte writes. NVIDIA's drivers explicitly align command structures to 64 bytes for this reason. After writing the batch, the driver issues sfence then writes the doorbell register, guaranteeing the GPU sees commands before the doorbell.

Rule of thumb: a partial WCB flush (say, 16 bytes) costs roughly the same PCIe overhead as a full 64-byte flush, so partial flushes waste up to 4× of your effective bandwidth. If you're streaming to MMIO and seeing one-quarter of expected throughput, you're almost certainly evicting WCBs early — usually because you've interleaved stores to too many different cache lines and exhausted the available buffers.

Intel exposes a more direct mechanism with MOVDIR64B (DSA/Sapphire Rapids+), which performs a guaranteed atomic 64-byte write bypassing WCBs entirely — useful when you absolutely must flush a doorbell as one transaction.

Key Takeaway: Write combining buffers turn dozens of tiny MMIO stores into single 64-byte bus transactions, but only if you write sequentially within a line and don't exhaust the per-core buffer count.

All newsletters