Daily Low-Level Programming: Thread-Local Storage: How Each Thread Gets Its Own Globals

Thread-Local Storage: How Each Thread Gets Its Own Globals

2026-05-01

When you write __thread int errno; or thread_local int counter = 0;, each thread gets its own independent copy of that variable. But how? The variable lives at a fixed offset, yet each thread sees a different value. The trick is a dedicated segment register pointing to per-thread memory.

The Mechanism on x86-64 Linux: The FS segment register points to each thread's Thread Control Block (TCB). When the kernel creates a thread (via clone()), it calls arch_prctl(ARCH_SET_FS, addr) to set that thread's FS base to a unique address. Every TLS access compiles down to something like:

mov eax, dword ptr fs:[variable_offset]

Same instruction in every thread, same offset — but FS points somewhere different, so each thread reads its own copy. On ARM64, the equivalent is the TPIDR_EL0 register, accessed via mrs x0, tpidr_el0.

The Four TLS Models: The compiler and linker choose an access model based on where the TLS variable is defined:

Local Exec — variable in the main executable, known at link time. Single FS-relative access. Fastest.
Initial Exec — variable in a shared library loaded at startup. One GOT lookup, then FS-relative. Fast.
Local Dynamic — multiple variables in the same dynamically-loaded library. One call to __tls_get_addr() per module, then offsets from that base.
General Dynamic — most general, works everywhere. Calls __tls_get_addr() for every access. Slowest.

You can force a model with -ftls-model=initial-exec, but use this carefully — it breaks if the library is dlopen()'d rather than linked at startup.

Real-world example: glibc's errno is TLS. Every system call wrapper sets errno without taking a lock, because each thread has its own copy. If errno were a plain global, multithreaded programs would be fundamentally broken — one thread's failed open() would clobber another thread's successful read() return status.

Performance rule of thumb: Local Exec TLS access costs exactly one instruction (a segment-offset load) — identical cost to a regular global. General Dynamic costs a function call to __tls_get_addr(), roughly 10–20ns overhead per access. If you're accessing TLS in a hot loop from a dlopen()'d library, consider caching the pointer locally:

int *p = &my_tls_var; for (...) { *p += ...; }

Gotcha: Each thread's TLS block is allocated during thread creation. If you dlopen() a library with TLS variables after threads already exist, the runtime must lazily allocate TLS blocks for those threads — this is one reason __tls_get_addr() is non-trivial and may call malloc() internally.

See it in action: Check out Java Threads: Master Thread-Local Storage by KnoDAX to see this theory applied.

Key Takeaway: Thread-local storage works by giving each thread a private memory region accessed through a dedicated CPU register (FS on x86-64, TPIDR_EL0 on ARM64), turning what looks like a global variable into per-thread state with zero locking overhead.

All newsletters