2026-06-08
You've used RDTSC to measure cycles. It reads the 64-bit Time Stamp Counter into EDX:EAX in a handful of cycles — far cheaper than any syscall-based clock. But there's a problem: RDTSC is not a serializing instruction. The out-of-order engine is free to execute it earlier or later than where you wrote it. Your "before" timestamp can be read after instructions you intended to measure, and your "after" timestamp can be read before the work finishes. The measurement becomes noise.
The traditional fix was to bracket RDTSC with CPUID, which fully serializes the pipeline. It works but is brutal — CPUID can cost 200+ cycles and varies by leaf, polluting the very thing you're trying to measure.
RDTSCP (added with Nehalem/Barcelona) is a partial fix. It guarantees that all prior instructions in program order have completed before the TSC is read. It does not prevent later instructions from starting early. It also returns the IA32_TSC_AUX MSR in ECX — which Linux populates with the CPU number, so you can detect if you got migrated mid-measurement.
The canonical recipe for measuring a region:
CPUID (serialize), then RDTSC (read start). The CPUID fences anything from leaking up past the start.RDTSCP (waits for work to retire, then reads), then CPUID (so later instructions can't pull the end-read down past them).This is exactly what Intel's whitepaper "How to Benchmark Code Execution Times" prescribes, and what the Linux kernel's arch/x86/include/asm/msr.h uses in its precision-timing macros. PTP daemons, DPDK, and io_uring's IORING_FEAT_NATIVE_WORKERS stat collection all use RDTSCP for sub-microsecond timing where syscall overhead would dwarf the measurement.
Rule of thumb for choosing your timer:
RDTSCP + CPUID bracket. Pin the thread (sched_setaffinity) and check ECX matches at start/end.clock_gettime(CLOCK_MONOTONIC) via vDSO is fine; the ~20 ns of overhead is negligible.The trap: the TSC frequency is not the CPU frequency on modern parts — it ticks at the nominal "base" rate regardless of turbo or P-state. Convert with the TSC frequency from /sys/devices/system/cpu/cpu0/tsc_freq_khz or CPUID leaf 0x15, not from /proc/cpuinfo's reported MHz.
