2026-05-11
SIMD (Single Instruction, Multiple Data) lets one CPU instruction operate on multiple values packed into a wide register. On x86-64, SSE gives you 128-bit registers (xmm0-xmm15), AVX2 extends to 256 bits (ymm0-ymm15), and AVX-512 to 512 bits (zmm0-zmm31). ARM provides NEON (128-bit) and SVE (variable-length, up to 2048 bits).
A 256-bit AVX2 register holds 8 floats, 4 doubles, 32 bytes, or 16 shorts. One vaddps ymm0, ymm1, ymm2 instruction adds 8 pairs of floats in roughly the same time as a single scalar add. That's the theoretical 8x speedup — though memory bandwidth, dependency chains, and lane shuffles usually cut into it.
How you actually get SIMD code:
-O3 -march=native is your friend. Use __restrict on pointers to help the compiler._mm256_add_ps() map 1:1 to instructions but are written in C. Portable across compilers, painful across ISAs.Real example — summing a float array. Scalar code processes one element per iteration. AVX2 code loads 8 floats into a ymm register, adds to an accumulator vector, and at the end does a horizontal reduction (sum-across-lanes) to get the final scalar. On Skylake, a tight scalar sum hits ~1 element/cycle (limited by add latency); the AVX2 version with 4 accumulators to break the dependency chain hits ~32 elements/cycle. That's the real-world 4-8x win, and it's why BLAS, ffmpeg, and JSON parsers like simdjson live and die by SIMD.
The alignment gotcha: Aligned loads (vmovaps) require 32-byte alignment for AVX2. Unaligned (vmovups) works on any address but historically cost more — on modern CPUs the penalty is small unless the load crosses a cache line (64 bytes). Allocate with aligned_alloc(32, n) or posix_memalign.
Rule of thumb: Speedup ≈ (vector_width_bits / element_size_bits) × 0.5 to 0.8. So AVX2 floats: 8 × 0.6 ≈ 5x realistic. AVX-512 doubles: 8 × 0.5 ≈ 4x (and watch for downclocking — heavy AVX-512 drops core frequency 10-20% on older Intel chips, which can erase the win if surrounding code is scalar).
Check /proc/cpuinfo flags (avx2, avx512f) or use __builtin_cpu_supports("avx2") for runtime dispatch.
