2026-05-28
The asker has a single-producer / single-consumer ring buffer in C that runs cleanly on x86 (Intel i7) but intermittently corrupts data on a Cortex-A53. They've added memory barriers and the problem persists. This is the classic "my lock-free code works on x86 but explodes on ARM" trap.
Why it's hard. x86 has a strong memory model (TSO): stores from a single core become visible to other cores in program order, and loads are not reordered with earlier loads. So a lot of "lock-free" code that's quietly wrong appears correct on x86 because the hardware papers over missing synchronization. ARMv8 has a weak memory model — independent loads and stores can be reordered, and writes can become visible to other cores in different orders unless you tell the CPU otherwise.
Most likely root causes.
head (so the payload writes are visible before the index update) and acquire semantics when the consumer reads head (so payload reads happen after seeing the new index). A single __sync_synchronize() on one side isn't enough.atomic_load_explicit/atomic_store_explicit with memory_order_acquire and memory_order_release.Direction to a fix. Rewrite with C11 atomics:
// producer
memcpy(&buf[head], src, n);
atomic_store_explicit(&g_head, (head + n) & mask, memory_order_release);
// consumer
size_t h = atomic_load_explicit(&g_head, memory_order_acquire);
if (h != tail) { memcpy(dst, &buf[tail], ...); ... }
The acquire/release pair maps to LDAR/STLR on ARMv8, which is exactly what you need — and it's cheaper than a full DMB ISH.
Gotchas. Cortex-A53 is in-order but still reorders memory operations (it's the memory subsystem, not the pipeline, that matters). If the ring buffer is shared with a peripheral or DMA engine, you also need DMB OSH / DSB and possibly non-cacheable mappings — acquire/release only synchronize between CPU cores. And don't trust testing: a working run proves nothing on weak-memory hardware. Run TSan, or stress with perf and pin producer/consumer to different cores.
