Daily Low-Level Programming: The Task State Segment in Long Mode: Why x86-64 Still Needs a 1980s Data Structure

The Task State Segment in Long Mode: Why x86-64 Still Needs a 1980s Data Structure

2026-05-31

The Task State Segment (TSS) was designed in 1985 for hardware task switching — the CPU would save all registers into a TSS on a task switch, swap TSS selectors, and restore the new task's state. Long mode threw all of that away. Hardware task switching doesn't exist in 64-bit mode. Yet every x86-64 CPU still requires a valid TSS, and getting it wrong means triple faults on the first interrupt.

What survived is the part the kernel actually needs: stack pointers for privilege transitions. The 64-bit TSS is 104 bytes containing three things that matter:

RSP0, RSP1, RSP2 — the kernel stack pointer to load when the CPU transitions from ring 3/2/1 to ring 0/1/2. On a syscall via interrupt or exception from userspace, the CPU reads RSP0 from the TSS and switches to it before pushing the trap frame.
IST1–IST7 — the Interrupt Stack Table. Seven "known good" stack pointers that specific interrupt vectors can be configured to use unconditionally, regardless of current privilege. The IDT entry for a vector contains a 3-bit IST index; if non-zero, the CPU loads that IST entry as RSP instead of using RSP0.
I/O permission bitmap offset — vestigial, rarely used.

The IST is the critical feature. Consider a double fault (#DF, vector 8). If your kernel stack is corrupted or unmapped, taking any interrupt would push a trap frame onto a bad stack, immediately faulting again — a triple fault, which resets the CPU. Linux assigns #DF its own IST entry pointing to a dedicated stack, so the double fault handler runs on guaranteed-good memory.

The same applies to NMI (non-maskable interrupt) and #MC (machine check). These can arrive at any instruction boundary, including mid-syscall when RSP is briefly in an undefined state. Without IST, an NMI during the SYSCALL instruction's stack-swap window would push to userspace's stack.

Rule of thumb: Linux uses 4 IST stacks per CPU (DF, NMI, MCE, DEBUG), each 16 KB. On a 256-core machine that's 256 × 4 × 16 KB = 16 MB of always-resident kernel memory just for "if everything else breaks, we still have a stack."

Real-world example: The Meltdown KPTI mitigation complicated this. With page-table isolation, the kernel stack pointed to by RSP0 isn't mapped in the userspace page table. The CPU loads RSP0 from the TSS before CR3 is swapped — so the TSS itself, and a tiny trampoline stack, must live in the kernel's "user-visible" minimal mapping. That's why cpu_entry_area exists in arch/x86/mm/cpu_entry_area.c: a per-CPU page-aligned region containing the TSS and trampoline stacks, mapped in both page tables.

Key Takeaway: The TSS in long mode is no longer about task switching — it exists solely to tell the CPU which stack to use on privilege transitions and which "safe" stacks to use for interrupts that must never fault.

All newsletters