Daily Hardware Architecture: SMT Resource Partitioning: How Hyperthreading Splits a Core Between Two Programs

SMT Resource Partitioning: How Hyperthreading Splits a Core Between Two Programs

2026-06-04

Simultaneous Multithreading (Intel calls it Hyperthreading) lets one physical core pretend to be two logical cores. The trick: most of a core's execution resources sit idle most of the time. A second thread can use the slack. But "sharing" hides a brutal question: which structures get split, and how?

There are three partitioning strategies, and every structure in the core picks one:

Statically partitioned — each thread gets exactly half. Used for the ROB, store buffer, and load buffer on Intel. Why? These are ordered queues. If one thread fills the whole ROB, the other thread stalls completely — and the OS scheduler can't tell.
Competitively shared — first come, first served. Used for the scheduler/issue queue, physical register file, and caches. One thread can dominate if it has more ready instructions. Great for throughput, terrible for fairness.
Replicated — each thread gets its own copy. Used for architectural register state, the return address stack, and instruction pointers. There's no way to share these; they define the thread.

The front-end alternates fetch between threads cycle-by-cycle (or by ICOUNT, picking whichever thread has fewer in-flight instructions). This is why a single-threaded benchmark on an SMT core often runs slightly slower than with SMT disabled — the fetch alternation steals cycles even when the other thread is idle, and competitively-shared caches get split between two working sets.

Concrete example: Run two memory-bound threads on one SMT core. Each thinks it has the full L1, but they share 32 KB. Effective per-thread cache is ~16 KB. If both working sets exceed 16 KB but fit in 32 KB, SMT can make both threads slower than running them sequentially. This is why HPC shops routinely disable SMT — their codes are tuned to use the whole cache.

Rule of thumb: SMT gives a 15–30% throughput boost when threads are diverse (one memory-bound, one compute-bound) and they don't fight for the same execution ports. It gives 0% or negative when threads are identical and already saturating one resource — two AVX-512 threads share one FMA unit and fight every cycle.

The killer detail: a thread that takes an L3 miss holds its ROB slots for ~200 cycles doing nothing. Without SMT, the core is idle. With SMT, the other thread keeps the back-end busy. SMT's real win isn't parallelism — it's latency hiding.

See it in action: Check out [2024] CPU Cores

amp; Threads Explained in 6 Minutes by Indigo Software to see this theory applied.

Key Takeaway: SMT is a bet that two threads' stalls will overlap with each other's work — when their resource demands collide instead, you pay the partitioning cost without gaining the latency-hiding benefit.

All newsletters