2026-05-11
The asker has two threads coordinating through a shared memory-mapped flag: a RISC-V core embedded behind a PCIe link, and an x86-64 host. Each side writes 1 to signal readiness and spins reading the other's word. The puzzle is why the x86 side appears to need an explicit read barrier despite x86's famously strong TSO model, where loads are not reordered with other loads.
The interesting part is that the usual mental model — "x86 is strongly ordered, so I only need volatile for MMIO" — quietly breaks down once you cross a PCIe boundary into a region that is not part of the coherence fabric. The CPU's memory-ordering guarantees are about how its own core observes operations relative to other coherent agents. They say nothing about how a PCIe endpoint's writes propagate up through the root complex, posted-write buffers, and into the host's view.
The direction toward a clean answer:
mfence (or a locked op) before the spin-load forces drain of the store buffer and write-combining buffers and prevents speculative reads from being satisfied stale.fence ow,ow (or similar) between data writes and the flag write, because RVWMO will happily reorder them. volatile alone is insufficient — it gives per-access atomicity from the compiler's view, nothing about the hardware.volatile is a compiler contract only; neither side gets any cross-agent ordering from it.Gotcha worth flagging: even with correct barriers, a write from the device may sit in an intermediate switch's posted-write queue. The canonical "flush" trick is for the writer to issue a read back from the same region after the flag write — a non-posted transaction that cannot complete until the prior posted write has drained.
