2026-04-22
When the OS switches from one process to another, something mechanical and precise happens at the CPU level. This is the context switch — and understanding its internals explains why threads are cheaper than processes, why syscalls have overhead, and why certain real-time guarantees are hard to meet.
What gets saved and restored: A context switch must preserve the full architectural state of the outgoing task. On x86-64, this means saving at minimum:
The kernel stores this state in a per-task structure. In Linux, this is the task_struct, and the architecture-specific register state lives in thread_struct embedded within it. The actual swap happens in architecture-specific assembly — on x86-64 Linux, look at __switch_to_asm in arch/x86/entry/entry_64.S. It pushes callee-saved registers onto the old kernel stack, switches the stack pointer to the new task's kernel stack, and pops the new task's registers.
Thread vs. process switch: When switching between threads in the same process, the kernel skips the expensive parts: it does not need to flush the TLB or swap page tables (CR3 on x86), because threads share an address space. A process switch must write a new value to CR3, which on older hardware implicitly flushes the entire TLB. Modern CPUs support PCID (Process Context Identifiers) — tagging TLB entries with an address-space ID so they survive a CR3 write. This is a significant optimization; TLB misses after a context switch can cost 100+ cycles each.
The real-world cost: A pure register-save/restore context switch takes roughly 1–2 microseconds on modern hardware. But the indirect cost dominates: cold caches. After switching, the new process finds its working set evicted from L1/L2. A rule of thumb: budget 5–10 microseconds of effective overhead per context switch when accounting for cache warm-up on a typical server workload. If your process has a 32 KB L1d working set, that is roughly 500 cache lines at 64 bytes each. Refilling those at ~4 ns per L2 hit adds ~2 µs just for L1 warm-up.
Lazy FPU switching: Because saving/restoring AVX-512 state is expensive, Linux historically used lazy FPU restore — setting CR0.TS so the first FPU instruction traps, and only then restoring FPU state. Modern kernels have mostly moved to eager FPU switching using XSAVEOPT/XRSTOR, because the trap-based approach creates timing side channels (the basis of the LazyFP vulnerability, CVE-2018-3665).
