Stack Overflow Unanswered: Do any ARMv8 processors exhibit load buffering?

Do any ARMv8 processors exhibit load buffering?

2026-05-15

Stack Overflow: View Question

Tags: arm, memory-model, armv8, relaxed-atomics

Score: 0 | Views: 74

The asker poses the classic load buffering (LB) litmus test: two threads each load one shared variable and then store to another. Under sequential consistency, at least one thread must observe its load before the other thread's store, so reading tmp0 == tmp1 == 1 is impossible. The ARMv7 memory model architecturally permits this outcome (a store can be reordered past an earlier load to a different address). The question is whether any real ARMv8 implementation actually exhibits LB in practice.

Why this is interesting: ARMv8 tightened the architectural memory model relative to ARMv7. While loads can still be hoisted past earlier loads/stores in the abstract machine, the requirement to forbid "out-of-thin-air" values and the introduction of multi-copy atomicity (writes become visible to all observers simultaneously) constrains the implementation space significantly. Even where the architecture permits LB, vendors may not exploit that latitude because:

It requires aggressive speculative store-forwarding across threads or speculative early commit of stores past unresolved loads.
Most cores serialize stores in program order at retire/commit, so the second store can't become globally visible until the prior load resolves.
The dependency-ordering rules (address/data/control dependencies preserve ordering) further restrict where reordering can be observed.

A solution approach: The asker should look at empirical surveys rather than hope for a definitive vendor answer. The canonical resources are:

The herd7 / litmus7 / diy7 tools from the Cambridge/INRIA memory-model group. They include catalogued results across A53, A55, A57, A72, Cortex-X, Apple M1/M2, Graviton, etc.
The papers by Pulte, Flur, Deacon et al. ("Simplifying ARM concurrency", 2018) which classify which litmus tests are observed versus merely allowed.
Run litmus7 directly on a target. The LB test typically needs hundreds of millions of iterations with pinned threads on separate cores and carefully placed memory to surface.

The empirical answer, last I saw it: plain LB (independent loads and stores, no dependencies) is occasionally observed on some out-of-order ARMv8 cores, while dependency-carrying variants (LB+data+data, LB+ctrl+ctrl) are not. The Apple M-series and large Cortex-X cores are the most likely candidates because their reorder windows are huge.

Gotchas: Don't conflate "ARMv8 forbids out-of-thin-air" with "LB is forbidden" — OOTA refers to value-fabrication through cyclic dependencies, not LB. Also, compiler reordering can produce LB even if the hardware wouldn't, so use inline assembly when running litmus tests. Finally, multi-copy atomicity (added in ARMv8) rules out IRIW, not LB.

The challenge: Bridging the gap between what the ARMv8 architecture permits and what real silicon actually exhibits requires empirical litmus testing across a fleet of microarchitectures — there's no clean "yes/no" answer in the spec.

All newsletters