DMA Engines: How Hardware Moves Data Without Bothering the CPU

2026-04-26

Every cycle your CPU spends copying bytes from a network card to memory is a cycle it's not running your code. Direct Memory Access (DMA) solves this by giving peripherals their own path to main memory, supervised but not micromanaged by the processor.

The problem DMA solves. Without DMA, the CPU must execute a load/store loop for every byte transferred — called Programmed I/O (PIO). Copying a 1500-byte Ethernet frame at one 32-bit word per cycle on a 3 GHz core costs ~375 cycles of pure data shuffling, plus loop overhead. A busy 10 GbE link delivers ~800,000 frames per second. That's roughly 300 million cycles/sec burned on memcpy — about 10% of one core doing nothing but hauling bytes. DMA drops that to near zero CPU cost per transfer.

How it works. The CPU sets up a descriptor — a small struct in memory containing the source address, destination address, transfer length, and control flags — then writes the descriptor's address to the DMA engine's control register. From that point, the DMA engine becomes a bus master: it independently issues read and write transactions on the memory bus (or PCIe fabric) to move data. When finished, it raises an interrupt or sets a completion flag.

Scatter-gather. Real workloads rarely move one contiguous block. Network stacks assemble packets from headers in one buffer and payloads in another. Modern DMA engines accept a linked list of descriptors (a scatter-gather list), processing each in sequence without CPU intervention. Linux's struct scatterlist maps directly to this hardware concept.

Cache coherence complications. DMA writes land in main memory, not the CPU's caches. If the CPU has a stale cached copy of that address, it reads old data. Two solutions exist:

IOMMU: DMA's safety net. A rogue or compromised device doing bus-master DMA can write anywhere in physical memory — a real security hole. The IOMMU (Intel VT-d, ARM SMMU) sits between devices and memory, translating device-visible addresses through page tables, just like a TLB does for the CPU. This enables both isolation and the ability to present contiguous address ranges to devices from physically scattered pages.

Rule of thumb: if a transfer is under ~64 bytes, PIO is often faster because descriptor setup and interrupt latency dominate. Above that, DMA wins, and the advantage grows linearly with transfer size.

See it in action: Check out Too many Blue screens! 😡 #pcrepair #techvideo #pcgaming #pcbuild #pc by GamerTechToronto to see this theory applied.
Key Takeaway: DMA offloads bulk data movement from the CPU to dedicated hardware engines, but demands careful attention to cache coherence and memory safety via IOMMUs.

All newsletters