Stack Overflow Unanswered: Two threads communicating across weak and strong memory models (non-cache-coherent PCIe), why a read barrier is needed?

Two threads communicating across weak and strong memory models (non-cache-coherent PCIe), why a read barrier is needed?

2026-05-11

Stack Overflow: View Question

Tags: c, x86, volatile, memory-barriers, pci-e

Score: 10 | Views: 237

The asker has two threads coordinating through a shared memory-mapped flag: a RISC-V core embedded behind a PCIe link, and an x86-64 host. Each side writes 1 to signal readiness and spins reading the other's word. The puzzle is why the x86 side appears to need an explicit read barrier despite x86's famously strong TSO model, where loads are not reordered with other loads.

The interesting part is that the usual mental model — "x86 is strongly ordered, so I only need volatile for MMIO" — quietly breaks down once you cross a PCIe boundary into a region that is not part of the coherence fabric. The CPU's memory-ordering guarantees are about how its own core observes operations relative to other coherent agents. They say nothing about how a PCIe endpoint's writes propagate up through the root complex, posted-write buffers, and into the host's view.

Coherence vs. ordering are different problems. x86 TSO orders accesses against the cache-coherent memory subsystem. A non-coherent PCIe BAR is mapped (typically) as UC or WC memory, which has its own ordering rules — and crucially, writes from the device may be buffered in the PCIe hierarchy.
The "barrier" on x86 isn't really fencing the CPU pipeline in the usual sense. It's forcing the implementation to re-issue the load through to the device, defeating any stale cached value or write-combining buffer, and ensuring that previously-observed device writes have actually reached the host before the next read is satisfied.
Posted writes are the silent culprit. A PCIe memory write from the RISC-V side is posted: it completes locally long before the host sees it. The reverse (host read) is non-posted and serializes, which is why the host-side read often is the synchronizing event — but only if the compiler and CPU don't elide or reorder it.

The direction toward a clean answer:

Check the MTRR/PAT type of the BAR mapping on the x86 side. UC gives strict ordering per access; WC does not. This dominates the discussion.
On x86, mfence (or a locked op) before the spin-load forces drain of the store buffer and write-combining buffers and prevents speculative reads from being satisfied stale.
On RISC-V, the device side genuinely needs fence ow,ow (or similar) between data writes and the flag write, because RVWMO will happily reorder them. volatile alone is insufficient — it gives per-access atomicity from the compiler's view, nothing about the hardware.
Beware that volatile is a compiler contract only; neither side gets any cross-agent ordering from it.

Gotcha worth flagging: even with correct barriers, a write from the device may sit in an intermediate switch's posted-write queue. The canonical "flush" trick is for the writer to issue a read back from the same region after the flag write — a non-posted transaction that cannot complete until the prior posted write has drained.

The challenge: x86's strong memory model is a per-core, cache-coherent guarantee — it doesn't extend across a PCIe boundary, so reasoning that works for SMP threads silently fails when one "thread" is actually a device behind a posted-write fabric.

All newsletters