Daily Hardware Architecture: Hardware Transactional Memory: When CPUs Pretend Multiple Things Happened at Once

Hardware Transactional Memory: When CPUs Pretend Multiple Things Happened at Once

2026-05-09

Hardware Transactional Memory (HTM) lets a thread mark a region of code as a transaction — the CPU executes it speculatively, tracking every line read and written. If no other core touches those lines before commit, the transaction succeeds atomically. If anyone conflicts, the CPU rolls back all changes and you fall through to a software path. It's optimistic concurrency baked into silicon.

How the hardware actually does it:

Read set & write set tracking: Cache lines touched during the transaction are tagged with transactional bits in L1. On Intel TSX, this lives in the L1D — which is why your transaction's working set must fit in L1 (~32KB) or it aborts.
Conflict detection rides MESI: If another core sends a coherence message (RFO, invalidate) for a line in your read or write set, the cache controller signals an abort. No special protocol — it reuses cache coherence you already paid for.
Buffered writes: Stores during the transaction stay in the cache marked "speculative." On commit, the bits flip and writes become globally visible atomically. On abort, the lines are invalidated — your writes vanish.
Abort triggers beyond conflicts: Context switches, page faults, syscalls, interrupts, even certain instructions (CPUID, RDTSCP) abort. Transactions are inherently best-effort.

The two flavors Intel shipped (TSX):

HLE (Hardware Lock Elision): Prefix XACQUIRE/XRELEASE on a lock instruction. The CPU skips writing the lock and runs the critical section transactionally. If it commits, the lock was never taken — multiple threads run "in" the same lock simultaneously. Falls back to real locking on abort.
RTM (Restricted Transactional Memory): Explicit XBEGIN/XEND/XABORT. You write the fallback path yourself. More flexible, more work.

Concrete example: A hash table with a single lock. Under HLE, ten threads doing inserts to different buckets all elide the lock and run in parallel — zero contention because their write sets don't overlap. The moment two threads hit the same bucket, one aborts and retries under the real lock. Glibc's pthread_mutex used HLE on Haswell-era CPUs for exactly this reason.

Rule of thumb: Keep transactions under ~8KB of touched data and under ~10,000 cycles. Beyond that, abort rates from cache evictions and timer interrupts make the fallback path dominate, and you're slower than just taking the lock.

The cautionary tale: Intel disabled TSX via microcode on most CPUs after 2021 — it was the substrate for the TAA (TSX Asynchronous Abort) side-channel, leaking data across security boundaries during aborts. IBM's POWER and z/Architecture still ship HTM. The idea isn't dead, but x86's commercial run was cut short by its own speculation leaks.

See it in action: Check out I Mined Bitcoin with Pencil and Paper for 2 Hours by Data Slayer to see this theory applied.

Key Takeaway: HTM piggybacks on cache coherence to make optimistic critical sections nearly free under low contention — but transactions are best-effort, bounded by L1, and on x86 became collateral damage of the speculative-execution security era.

All newsletters