2026-05-26
Out-of-order CPUs execute loads as early as possible — often before older loads from other addresses have even issued. This is a gamble: the memory consistency model (x86's TSO, ARM's weaker model) promises certain ordering guarantees to other cores. If another core's store sneaks into the window between when you speculatively loaded and when an older load on your core completed, you've violated ordering. The Memory Order Buffer (MOB) is the hardware that catches this and forces a rewind.
The MOB sits alongside the load and store queues. Every speculatively executed load records its address and the value it observed. When a cache line gets invalidated by another core's coherence message (an MESI invalidation), the MOB snoops the invalidation against every in-flight load. If a load already retired its value from that line — but an older load (in program order) hasn't completed yet — that's a potential ordering violation. On x86, where loads must appear in program order, the offending younger load and everything after it get squashed and re-executed.
Concrete example: Intel's Nehalem and every successor expose this as the machine_clears.memory_ordering performance counter. A classic trigger: thread A spins on while (!flag) {} while thread B writes data then sets flag. If thread A's CPU speculatively loads data before its load of flag completes, and thread B's store to data arrives between them, the MOB detects the invalidation hitting that pre-loaded line and nukes the pipeline. You can see this counter spike on producer/consumer code where the consumer polls aggressively — sometimes thousands of machine clears per second, each costing 15–40 cycles of recovery plus a full pipeline flush.
Rule of thumb: A memory ordering machine clear costs roughly 20–40 cycles minimum (flush + refetch + redecode), but the real cost is the lost speculative work — often 100+ cycles of useful execution thrown away. If machine_clears.memory_ordering exceeds ~1 per 10,000 cycles, your hot loop is probably reading shared data without proper isolation (cache-line padding, or restructuring to read-mostly patterns).
ARM CPUs have a similar structure but a much easier job: their weaker memory model permits load reordering, so the MOB only needs to catch violations of explicit barriers (DMB, DSB) and same-address ordering. This is one reason ARM cores can have simpler, smaller MOBs than x86 — TSO is genuinely expensive to enforce in hardware, and every Intel core pays for it on every speculative load.
