2026-06-04
Simultaneous Multithreading (Intel calls it Hyperthreading) lets one physical core pretend to be two logical cores. The trick: most of a core's execution resources sit idle most of the time. A second thread can use the slack. But "sharing" hides a brutal question: which structures get split, and how?
There are three partitioning strategies, and every structure in the core picks one:
The front-end alternates fetch between threads cycle-by-cycle (or by ICOUNT, picking whichever thread has fewer in-flight instructions). This is why a single-threaded benchmark on an SMT core often runs slightly slower than with SMT disabled — the fetch alternation steals cycles even when the other thread is idle, and competitively-shared caches get split between two working sets.
Concrete example: Run two memory-bound threads on one SMT core. Each thinks it has the full L1, but they share 32 KB. Effective per-thread cache is ~16 KB. If both working sets exceed 16 KB but fit in 32 KB, SMT can make both threads slower than running them sequentially. This is why HPC shops routinely disable SMT — their codes are tuned to use the whole cache.
Rule of thumb: SMT gives a 15–30% throughput boost when threads are diverse (one memory-bound, one compute-bound) and they don't fight for the same execution ports. It gives 0% or negative when threads are identical and already saturating one resource — two AVX-512 threads share one FMA unit and fight every cycle.
The killer detail: a thread that takes an L3 miss holds its ROB slots for ~200 cycles doing nothing. Without SMT, the core is idle. With SMT, the other thread keeps the back-end busy. SMT's real win isn't parallelism — it's latency hiding.
