2026-05-02
Every time you spin up a Docker container, a cloud VM, or WSL, you're relying on hardware virtualization extensions baked into your CPU. Before these existed, virtualization was a painful software trick. Let's look at what the silicon actually does.
The problem: x86 wasn't virtualizable. In the 1990s, Popek and Goldberg formalized that a CPU is virtualizable if all sensitive instructions trap when executed in unprivileged mode. x86 failed this test — 17 instructions behaved differently in user mode instead of trapping. For example, POPF silently ignores the interrupt flag when run in Ring 3. A guest OS running in Ring 1 or Ring 3 couldn't tell it wasn't in Ring 0. VMware's original solution was binary translation — scanning guest code and patching these instructions at runtime. It worked, but added ~15-25% overhead on privileged code paths.
Intel VT-x (2005) and AMD SVM (2006) fixed this in hardware. Both added a new privilege level below Ring 0. Intel calls it VMX root mode; AMD calls it host mode. The hypervisor runs here. Guest OSes run in Ring 0 as usual — but inside a non-root context. When the guest executes a sensitive instruction, the CPU triggers a VM exit: it saves the entire guest state into a memory structure (Intel's VMCS or AMD's VMCB), loads the hypervisor's state, and transfers control. The hypervisor handles the event, then executes VMRESUME to re-enter the guest.
The VMCS (Virtual Machine Control Structure) is the key data structure — roughly 4 KB, containing guest register state, control fields that specify which events cause exits, and host state for the hypervisor to resume into. You configure it to trap on I/O port access, MSR reads, CR register writes, or specific interrupts. This is how the hypervisor controls what the guest can and can't touch without binary translation.
Nested page tables (EPT/NPT) added a second critical piece. Without them, every guest page table update required a VM exit so the hypervisor could maintain shadow page tables — thousands of exits per second during boot. EPT (Intel) and NPT (AMD) add a second level of address translation in hardware: guest-virtual → guest-physical → host-physical. The TLB walker handles both levels. This eliminated shadow page tables and cut VM exit rates dramatically.
Rule of thumb: a VM exit costs roughly 500–1500 cycles on modern hardware (save state, switch context, restore state). If your workload triggers 100,000 exits/second on a 4 GHz core, that's ~50–150 million cycles lost — about 1–4% overhead just from transitions. This is why passthrough I/O (VT-d/SR-IOV) matters for network-heavy VMs: it eliminates exits on every packet.
Real-world example: KVM on Linux uses VT-x directly. When you launch a QEMU/KVM guest, the KVM_RUN ioctl executes VMLAUNCH. Every time the guest hits a trapped instruction, the CPU exits to KVM's handler in the host kernel. Run perf kvm stat live on a Linux host and you'll see exactly which exit reasons dominate — typically EPT violations, I/O instructions, and MSR accesses.
