2026-05-01
When you write __thread int errno; or thread_local int counter = 0;, each thread gets its own independent copy of that variable. But how? The variable lives at a fixed offset, yet each thread sees a different value. The trick is a dedicated segment register pointing to per-thread memory.
The Mechanism on x86-64 Linux: The FS segment register points to each thread's Thread Control Block (TCB). When the kernel creates a thread (via clone()), it calls arch_prctl(ARCH_SET_FS, addr) to set that thread's FS base to a unique address. Every TLS access compiles down to something like:
mov eax, dword ptr fs:[variable_offset]
Same instruction in every thread, same offset — but FS points somewhere different, so each thread reads its own copy. On ARM64, the equivalent is the TPIDR_EL0 register, accessed via mrs x0, tpidr_el0.
The Four TLS Models: The compiler and linker choose an access model based on where the TLS variable is defined:
__tls_get_addr() per module, then offsets from that base.__tls_get_addr() for every access. Slowest.You can force a model with -ftls-model=initial-exec, but use this carefully — it breaks if the library is dlopen()'d rather than linked at startup.
Real-world example: glibc's errno is TLS. Every system call wrapper sets errno without taking a lock, because each thread has its own copy. If errno were a plain global, multithreaded programs would be fundamentally broken — one thread's failed open() would clobber another thread's successful read() return status.
Performance rule of thumb: Local Exec TLS access costs exactly one instruction (a segment-offset load) — identical cost to a regular global. General Dynamic costs a function call to __tls_get_addr(), roughly 10–20ns overhead per access. If you're accessing TLS in a hot loop from a dlopen()'d library, consider caching the pointer locally:
int *p = &my_tls_var; for (...) { *p += ...; }
Gotcha: Each thread's TLS block is allocated during thread creation. If you dlopen() a library with TLS variables after threads already exist, the runtime must lazily allocate TLS blocks for those threads — this is one reason __tls_get_addr() is non-trivial and may call malloc() internally.
