2026-04-26
Every device hanging off your CPU — GPUs, NVMe drives, NICs, USB controllers — communicates through PCIe (Peripheral Component Interconnect Express). Understanding PCIe architecture explains why your NVMe is fast, why GPU slot placement matters, and why "bandwidth" isn't the whole story.
Lanes and Links. PCIe is a serial, point-to-point protocol. A single lane is one pair of differential signal wires in each direction (TX and RX) — full duplex from the start. Lanes are bundled into links: x1, x2, x4, x8, or x16. A GPU typically gets x16; an NVMe SSD gets x4. Each lane in PCIe 4.0 delivers roughly 2 GB/s in each direction (16 GT/s with 128b/130b encoding), so a x16 Gen4 link provides about 32 GB/s each way. Quick rule of thumb: PCIe Gen N doubles the per-lane bandwidth of Gen N-1. Gen3 = ~1 GB/s/lane, Gen4 = ~2, Gen5 = ~4, Gen6 = ~8.
The Transaction Layer. PCIe uses a layered protocol: Physical → Data Link → Transaction. The transaction layer is where the interesting architecture lives. Devices communicate via Transaction Layer Packets (TLPs) — memory reads, memory writes, completions, and messages. A memory-mapped write from the CPU becomes a Posted TLP that flows directly to the device with no acknowledgment needed at the transaction layer. A memory read, however, is a Non-Posted transaction: the CPU sends a request TLP, then waits for a Completion TLP carrying the data. This asymmetry matters enormously for performance.
Why Latency Hurts. A single PCIe round-trip (read) on a typical desktop system takes 500 ns–1 μs. Compare that to an L3 cache hit at ~10 ns. If your driver does many small MMIO reads from a device register, each one stalls for that full round trip. This is why high-performance drivers use DMA and MSI-X interrupts instead of polling device registers — you let the device push data into host memory (posted writes, no round-trip penalty) and signal completion via an interrupt.
Real-world example: An NVMe SSD on a PCIe 4.0 x4 link has ~8 GB/s of raw bandwidth. The NVMe protocol submits I/O commands by writing to submission queue doorbells (a single posted MMIO write) and the drive DMAs completions back into host memory. The entire design minimizes round trips. This is why NVMe replaced AHCI/SATA — AHCI required multiple register reads per I/O operation, each paying that 500+ ns PCIe latency tax.
Root Complex and Switch Topology. The CPU's PCIe root complex is the top of a tree. PCIe switches fan out connections, but every hop through a switch adds latency (~100-200 ns). This is why plugging a GPU into a slot that routes through a chipset switch (e.g., the PCH) rather than directly into CPU lanes measurably hurts performance in latency-sensitive workloads.
