2026-05-24
The TSC is a 64-bit per-core register incremented every reference clock tick. You read it with RDTSC (or RDTSCP, which also returns the CPU ID). It's the foundation of clock_gettime(CLOCK_MONOTONIC) on Linux via the vDSO — no syscall, ~10-20 cycle latency. Everyone reaches for it when they need "real" timing. It will quietly betray you in at least four ways.
1. It doesn't tick at your CPU frequency. On every CPU since Nehalem (~2008), the TSC runs at a fixed reference frequency — not the dynamic core frequency. So a TSC delta of 3,000,000 ticks on a 3 GHz nominal CPU is 1 ms regardless of whether the core was boosting to 5 GHz or parked at 800 MHz. This is the "invariant TSC" (CPUID.80000007H:EDX[8]). Good for wall-clock, useless if you wanted cycle counts.
2. RDTSC is not a serializing instruction. The CPU can — and will — execute RDTSC before earlier instructions retire, or after later ones. Wrap a 50 ns memcpy in RDTSC calls and you may measure 0 or 200 ns. Use RDTSCP (partial serialization: waits for prior instructions) plus an LFENCE before the next RDTSC, or CPUID; RDTSC; ... ; RDTSCP; LFENCE. Intel publishes the recommended sequence in their benchmarking whitepaper.
3. Cross-core skew. On multi-socket systems, each socket has its own TSC, synchronized by firmware at boot. Linux checks for skew (dmesg | grep "tsc:"); if it fails, the kernel falls back to HPET and your vDSO clock_gettime just got 10x slower. On older systems with SMI handlers that touch the TSC, you can read time going backwards across cores.
4. Virtualization. Under KVM/VMware, RDTSC may trap into the hypervisor (adding microseconds) or be offset per-VM. kvm-clock exists precisely to paper over this.
Real-world example: A trading shop measures order-book update latency with naked RDTSC in a tight loop. They see "12 ns" updates and ship. In production, the same code measures 400 ns because the TSC read got reordered after the cache miss it was supposed to be timing. Adding LFENCE; RDTSC; LFENCE reveals reality.
Rule of thumb: TSC reference frequency ≈ CPU base (non-turbo) frequency. To convert ticks to nanoseconds: ns = ticks * 1e9 / tsc_khz / 1000, where tsc_khz is in /proc/cpuinfo or dmesg | grep "tsc: Refined". Budget ~20 cycles per RDTSC read; never measure anything shorter than ~50 cycles without fencing both ends.
