Daily Low-Level Programming: Context Switching Internals

Context Switching Internals

2026-04-22

When the OS switches from one process to another, something mechanical and precise happens at the CPU level. This is the context switch — and understanding its internals explains why threads are cheaper than processes, why syscalls have overhead, and why certain real-time guarantees are hard to meet.

What gets saved and restored: A context switch must preserve the full architectural state of the outgoing task. On x86-64, this means saving at minimum:

16 general-purpose registers (RAX through R15)
The instruction pointer (RIP) and stack pointer (RSP)
RFLAGS register
FPU/SSE/AVX state (which can be large — 512 bytes for SSE, up to 2688 bytes with AVX-512)
Segment registers and control registers (for full process switches)

The kernel stores this state in a per-task structure. In Linux, this is the task_struct, and the architecture-specific register state lives in thread_struct embedded within it. The actual swap happens in architecture-specific assembly — on x86-64 Linux, look at __switch_to_asm in arch/x86/entry/entry_64.S. It pushes callee-saved registers onto the old kernel stack, switches the stack pointer to the new task's kernel stack, and pops the new task's registers.

Thread vs. process switch: When switching between threads in the same process, the kernel skips the expensive parts: it does not need to flush the TLB or swap page tables (CR3 on x86), because threads share an address space. A process switch must write a new value to CR3, which on older hardware implicitly flushes the entire TLB. Modern CPUs support PCID (Process Context Identifiers) — tagging TLB entries with an address-space ID so they survive a CR3 write. This is a significant optimization; TLB misses after a context switch can cost 100+ cycles each.

The real-world cost: A pure register-save/restore context switch takes roughly 1–2 microseconds on modern hardware. But the indirect cost dominates: cold caches. After switching, the new process finds its working set evicted from L1/L2. A rule of thumb: budget 5–10 microseconds of effective overhead per context switch when accounting for cache warm-up on a typical server workload. If your process has a 32 KB L1d working set, that is roughly 500 cache lines at 64 bytes each. Refilling those at ~4 ns per L2 hit adds ~2 µs just for L1 warm-up.

Lazy FPU switching: Because saving/restoring AVX-512 state is expensive, Linux historically used lazy FPU restore — setting CR0.TS so the first FPU instruction traps, and only then restoring FPU state. Modern kernels have mostly moved to eager FPU switching using XSAVEOPT/XRSTOR, because the trap-based approach creates timing side channels (the basis of the LazyFP vulnerability, CVE-2018-3665).

See it in action: Check out OS Context Switching - Computerphile by Computerphile to see this theory applied.

Key Takeaway: The direct register save/restore in a context switch is cheap (~1–2 µs), but the indirect cost of cold caches and TLB misses after the switch is what actually hurts performance — budget 5–10 µs of effective overhead per switch.

All newsletters