DMA and Memory-Mapped I/O

2026-04-26

When a device needs to transfer data, there are three strategies: programmed I/O (the CPU manually reads/writes each byte), memory-mapped I/O (device registers appear as memory addresses), and DMA (the device reads/writes main memory directly, without CPU involvement). Understanding all three is essential for driver work and performance reasoning.

Programmed I/O (PIO) uses dedicated port instructions. On x86, in and out access a separate 64K I/O address space. This is how legacy devices like the 8259 PIC and PS/2 keyboard controller work. The CPU is blocked for every byte transferred — fine for a keyboard, catastrophic for a disk.

Memory-Mapped I/O (MMIO) maps device registers into the physical address space. You read and write them like normal memory, but the bus routes those accesses to the device instead of DRAM. A framebuffer is the classic example: writing to address 0xB8000 in real mode puts characters on the VGA text display. In modern systems, PCIe BARs (Base Address Registers) define MMIO windows. The kernel maps these into virtual address space with ioremap(), and the driver accesses them with readl()/writel() — not raw pointer dereferences, because those helpers enforce the correct memory ordering and prevent compiler reordering.

Why not just use regular pointer access for MMIO? Two reasons. First, the compiler may reorder, coalesce, or eliminate stores it thinks are redundant — but every write to a device register has side effects. Second, the CPU cache must be bypassed; MMIO regions are mapped as uncacheable (UC) or write-combining (WC) via page table attributes.

DMA solves the throughput problem. The CPU programs a DMA controller (or the device's built-in DMA engine) with a source address, destination address, and byte count. The device then transfers data directly to/from main memory over the bus. The CPU is free to do other work and gets an interrupt when the transfer completes. A modern NVMe SSD uses DMA for every block read/write — without it, a 7 GB/s SSD would saturate the CPU just moving bytes.

The IOMMU complication: DMA means a device can write to arbitrary physical memory — a security and stability risk. The IOMMU (Intel VT-d, ARM SMMU) translates device-visible addresses (IOVA) to physical addresses, giving each device a restricted view of memory. In Linux, the dma_map_single() API handles this translation and cache coherence in one call.

Rule of thumb: If your transfer size exceeds roughly 64 bytes (one cache line), DMA almost always beats PIO. For a 4 KB page, PIO requires ~1000 CPU load/store instructions; DMA requires programming ~3 registers plus one interrupt. That is a 100x reduction in CPU cycles consumed.

Real-world example: When your NIC receives a packet, it DMA-writes the packet data into a ring buffer in main memory, updates a descriptor, and fires an MSI-X interrupt. The driver reads the descriptor (via MMIO or cached memory), processes the packet, and advances the ring pointer. The CPU never touches the bulk packet bytes during the transfer itself.

See it in action: Check out Lecture 5: Memory Mapped I/O by Embedded Systems and Deep Learning to see this theory applied.
Key Takeaway: MMIO lets the CPU talk to device registers through normal load/store instructions at special addresses, while DMA lets devices transfer bulk data directly to memory without CPU involvement — and the IOMMU ensures they can only touch memory they are permitted to access.

All newsletters