Daily Low-Level Programming: The FS and GS Segment Registers in Long Mode: How One "Useless" Register Powers Every Thread-Local Variable

The FS and GS Segment Registers in Long Mode: How One "Useless" Register Powers Every Thread-Local Variable

2026-06-08

x86-64 dropped segmentation. Mostly. CS, DS, ES, SS are forced to base 0 — the segment fields are ignored, the limit is ignored, and any attempt to set a non-zero base silently does nothing. Long mode is flat. Except for two registers: FS and GS. These two kept their base address behavior, and they became the foundation of thread-local storage and per-CPU kernel data.

The trick: when an instruction uses an FS or GS segment prefix, the CPU adds the segment's base address (a full 64-bit value stored in a hidden MSR) to the address calculation. So mov rax, fs:[0x10] doesn't read address 0x10 — it reads FS_BASE + 0x10. This is a single-instruction indirection through a per-thread or per-CPU pointer, with no register cost.

The bases live in MSRs: IA32_FS_BASE (0xC0000100) and IA32_GS_BASE (0xC0000101). GS has a third: IA32_KERNEL_GS_BASE (0xC0000102), which the SWAPGS instruction atomically swaps with the active GS_BASE on kernel entry. Writing MSRs used to require a syscall (arch_prctl); modern CPUs added WRFSBASE/WRGSBASE so user code can do it in one instruction if the kernel enables the FSGSBASE bit in CR4.

Concrete example — pthread TLS on Linux: When glibc creates a thread, it allocates a Thread Control Block (TCB) and sets FS_BASE to point at it. A __thread int errno; compiles to something like:

mov eax, dword ptr fs:[errno@tpoff]

where errno@tpoff is a link-time-resolved offset into the TLS block. No table lookup, no atomic, no syscall — just FS_BASE + offset. The kernel uses GS the same way: mov rax, gs:[per_cpu_offset] reads the current CPU's per-CPU variable in one instruction, which is why this_cpu_read() in the Linux kernel is essentially free.

Rule of thumb: Any access of the form fs:[constant] or gs:[constant] costs the same as a normal memory load — ~4 cycles L1. Any access through a TLS pointer you loaded into a register costs the same plus the register pressure. So __thread variables are free; pthread_getspecific() (which goes through a function call and array lookup) is roughly 20x slower.

The kernel's SWAPGS on every syscall entry/exit is also why Meltdown-class bugs were so dangerous: a mistimed speculation could read kernel GS data through a user-mode prefix.

Key Takeaway: FS and GS survived x86-64's flat-memory purge specifically because their base-address behavior gives you a free per-thread or per-CPU pointer in a single instruction — every __thread variable and every kernel per-CPU variable rides on this.

All newsletters