GPU Architecture: The SIMT Execution Model

2026-04-29

You already understand SIMD — one instruction, multiple data lanes. GPUs take this idea and warp it into something stranger: Single Instruction, Multiple Threads (SIMT). Understanding SIMT is the key to understanding why GPU code performs brilliantly or terribly, with little in between.

In NVIDIA's architecture, the GPU groups 32 threads into a warp (AMD calls them wavefronts, typically 64 threads). Every thread in a warp executes the same instruction at the same time, on its own data. This looks like SIMD, but there's a critical difference: each thread has its own program counter logically, its own register state, and can branch independently. The hardware then reconciles this illusion with reality.

Divergence is the killer. When threads in a warp hit a branch and some go left while others go right, the hardware doesn't split the warp. Instead, it predicate-masks the inactive threads and runs both paths serially. A simple if/else where half the warp takes each side doesn't run at half speed — it runs both sides at full cost, with half the ALUs dark on each pass. Worst case, 32 threads taking 32 different paths serialize completely.

Rule of thumb: every divergent branch in a warp can double your effective execution time for that code section. If your kernel has 5 levels of nested divergent branches, you could be utilizing as little as 1/32 of your hardware.

The memory system is equally unforgiving. A warp issues a memory access as a group. If all 32 threads access consecutive 4-byte addresses, the hardware coalesces this into a single 128-byte transaction. If they access scattered addresses, you get up to 32 separate transactions. On an NVIDIA A100, coalesced global memory bandwidth is roughly 2 TB/s; scattered access can drop effective throughput to under 100 GB/s — a 20x penalty.

To hide memory latency, GPUs don't use deep out-of-order pipelines like CPUs. Instead, they use massive occupancy. An A100 SM (Streaming Multiprocessor) can hold up to 64 warps (2048 threads) simultaneously. When one warp stalls on memory, the scheduler instantly switches to another — zero-cost context switching because all register state is resident, not saved and restored. This is why GPUs have enormous register files: an A100 SM has 256 KB of registers, dwarfing any CPU core.

Real-world example: matrix multiplication maps perfectly to SIMT. Every thread computes one output element, all threads in a warp walk through the same loop iterations (no divergence), and memory access patterns tile naturally into coalesced loads. This is why GEMM routines achieve 90%+ of theoretical FLOPS, while a naive graph traversal kernel might achieve 5%.

See it in action: Check out Fundamentals of GPU Architecture: SIMT Core Part 1 by Nick to see this theory applied.
Key Takeaway: GPUs achieve massive throughput by running thousands of threads in lockstep warps, but divergent branching and scattered memory access can collapse performance by orders of magnitude — writing for the SIMT model means keeping threads uniform in both control flow and data access.

All newsletters