2026-05-15
The asker poses the classic load buffering (LB) litmus test: two threads each load one shared variable and then store to another. Under sequential consistency, at least one thread must observe its load before the other thread's store, so reading tmp0 == tmp1 == 1 is impossible. The ARMv7 memory model architecturally permits this outcome (a store can be reordered past an earlier load to a different address). The question is whether any real ARMv8 implementation actually exhibits LB in practice.
Why this is interesting: ARMv8 tightened the architectural memory model relative to ARMv7. While loads can still be hoisted past earlier loads/stores in the abstract machine, the requirement to forbid "out-of-thin-air" values and the introduction of multi-copy atomicity (writes become visible to all observers simultaneously) constrains the implementation space significantly. Even where the architecture permits LB, vendors may not exploit that latitude because:
A solution approach: The asker should look at empirical surveys rather than hope for a definitive vendor answer. The canonical resources are:
litmus7 directly on a target. The LB test typically needs hundreds of millions of iterations with pinned threads on separate cores and carefully placed memory to surface.The empirical answer, last I saw it: plain LB (independent loads and stores, no dependencies) is occasionally observed on some out-of-order ARMv8 cores, while dependency-carrying variants (LB+data+data, LB+ctrl+ctrl) are not. The Apple M-series and large Cortex-X cores are the most likely candidates because their reorder windows are huge.
Gotchas: Don't conflate "ARMv8 forbids out-of-thin-air" with "LB is forbidden" — OOTA refers to value-fabrication through cyclic dependencies, not LB. Also, compiler reordering can produce LB even if the hardware wouldn't, so use inline assembly when running litmus tests. Finally, multi-copy atomicity (added in ARMv8) rules out IRIW, not LB.
