SIMD/Vector Processing: One Instruction, Many Results

2026-04-24

Your CPU has wide datapaths sitting idle most of the time. A 64-bit ALU processing an 8-bit pixel wastes 56 bits of silicon every cycle. SIMD (Single Instruction, Multiple Data) fixes this by packing multiple narrow operations into one wide register and executing them in parallel with a single instruction.

The idea is deceptively simple. Instead of adding two 32-bit integers, you load four pairs of 32-bit integers into 128-bit registers and add all four pairs simultaneously. Same instruction issue slot, same decode logic, four times the throughput. The hardware cost is modest: you need wider register files and ALUs with partitioned carry chains (so carries don't ripple across element boundaries), but you reuse the existing pipeline, scheduler, and retirement logic.

The x86 SIMD evolution tells the real story. MMX (1997) gave you 64-bit registers crammed into the x87 FPU stack — a hack that made floating-point and SIMD mutually exclusive. SSE (1999) added dedicated 128-bit XMM registers, fixing that disaster. SSE2 brought integer ops into XMM registers. Then AVX (2011) doubled width to 256-bit YMM registers, and AVX-512 doubled again to 512-bit ZMM registers. Each generation roughly doubles throughput for vectorizable code — but only if your data is aligned and your access patterns are contiguous.

Rule of thumb: SIMD theoretical speedup = register width ÷ element width. A 256-bit AVX register processing 32-bit floats yields 256÷32 = 8x throughput per instruction. In practice, expect 3-5x after accounting for shuffle overhead, alignment penalties, and scalar loop tails.

ARM took a cleaner path. NEON (128-bit) shipped with consistent encoding from day one, avoiding x86's layered legacy mess. ARM's SVE (Scalable Vector Extension) went further: vector length is not baked into the instruction encoding. Code compiled for SVE runs on hardware with 128-bit to 2048-bit vectors without recompilation. The hardware sets a runtime vector length register, and predicate masks handle loop tails. This is genuinely elegant — it decouples the ISA from the microarchitecture.

The real bottleneck is rarely the ALU. SIMD shifts the problem to memory bandwidth. Processing 8 floats per cycle means you need to feed 8 floats per cycle. At 4 bytes each, that's 32 bytes/cycle. A core running at 4 GHz needs 128 GB/s from L1 alone — which is exactly why modern L1 caches have 64-byte read ports. If your data doesn't fit in L1, SIMD just makes you hit the memory wall faster.

Gather/scatter operations (loading non-contiguous elements into a vector register) are notoriously slow — often 4-10x worse than contiguous loads. AVX2's VGATHERDD still issues one cache access per element internally. If your algorithm needs scattered access, SIMD may not help at all.

See it in action: Check out Day 44: SIMD Vectorization in Go 1.26 – Unlocking Super-Charged Data Processing by systemdrllp11 to see this theory applied.
Key Takeaway: SIMD multiplies ALU throughput cheaply by packing parallel operations into wide registers, but its real-world benefit is gated by memory bandwidth and data layout — the fastest arithmetic in the world can't help if you can't feed it data fast enough.

All newsletters