2026-05-03
When you write std::atomic<int>::compare_exchange_strong() or use LOCK CMPXCHG in x86 assembly, something remarkable happens in hardware. The CPU must guarantee that a read-modify-write sequence on a memory location appears indivisible to every other core in the system. This isn't just software convention — it requires dedicated hardware support.
How x86 implements atomics: The LOCK prefix on x86 instructions (like LOCK ADD, LOCK CMPXCHG) originally asserted a bus lock signal, literally preventing other cores from accessing memory. Modern processors are smarter. If the target address is in the L1 cache and the line is in Modified or Exclusive state (from MESI), the core simply locks that cache line internally — no bus lock needed. This optimization, called cache lock, is why aligned atomic operations on x86 are far cheaper than you'd expect. An aligned LOCK ADD on a hot cache line costs roughly 20 cycles on modern Intel cores, versus hundreds of cycles if the line must be fetched from another core.
How ARM/RISC-V differ — LL/SC: ARM and RISC-V use a fundamentally different primitive: Load-Linked / Store-Conditional (called LDXR/STXR on ARM, LR/SC on RISC-V). Load-Linked reads a value and sets an invisible hardware reservation on that cache line. Store-Conditional writes back only if the reservation is still intact. If another core touched that line in between, the store fails and your code retries in a loop. The hardware tracks reservations using a small reservation register per core — typically just an address tag and a valid bit.
The tradeoff: x86's LOCK approach guarantees forward progress — the operation always completes. LL/SC can theoretically livelock if two cores keep invalidating each other's reservations. Real hardware mitigates this with randomized backoff and by making the reservation granularity a full cache line (64 bytes), but pathological cases exist. ARM added LSE (Large System Extensions) in ARMv8.1 with CAS/LDADD instructions that behave more like x86 atomics — a concession that LL/SC alone isn't always sufficient for large core counts.
Rule of thumb: An atomic operation on a cache line already owned by your core costs ~20 cycles. A contended atomic where the line bounces between cores costs 50–200+ cycles, dominated by the coherence round-trip across the interconnect. At 4 GHz, that 200-cycle penalty is 50 nanoseconds — enough time to do 800 simple ALU operations.
Real-world impact: This is why concurrent data structures use padding to prevent false sharing — if two independent atomics share a 64-byte cache line, every operation on one invalidates the other core's reservation or cache lock, inflating a 20-cycle operation to 200 cycles. Java's @Contended annotation and C++'s alignas(64) exist specifically because of this hardware reality.
