Daily Low-Level Programming: SIMD and Vector Instructions: Doing Four Things at Once

SIMD and Vector Instructions: Doing Four Things at Once

2026-05-11

SIMD (Single Instruction, Multiple Data) lets one CPU instruction operate on multiple values packed into a wide register. On x86-64, SSE gives you 128-bit registers (xmm0-xmm15), AVX2 extends to 256 bits (ymm0-ymm15), and AVX-512 to 512 bits (zmm0-zmm31). ARM provides NEON (128-bit) and SVE (variable-length, up to 2048 bits).

A 256-bit AVX2 register holds 8 floats, 4 doubles, 32 bytes, or 16 shorts. One vaddps ymm0, ymm1, ymm2 instruction adds 8 pairs of floats in roughly the same time as a single scalar add. That's the theoretical 8x speedup — though memory bandwidth, dependency chains, and lane shuffles usually cut into it.

How you actually get SIMD code:

Auto-vectorization: The compiler converts loops to SIMD when it can prove no aliasing, no early exits, and contiguous memory access. -O3 -march=native is your friend. Use __restrict on pointers to help the compiler.
Intrinsics: Functions like _mm256_add_ps() map 1:1 to instructions but are written in C. Portable across compilers, painful across ISAs.
Libraries: Highway, xsimd, or std::experimental::simd hide the per-ISA details.

Real example — summing a float array. Scalar code processes one element per iteration. AVX2 code loads 8 floats into a ymm register, adds to an accumulator vector, and at the end does a horizontal reduction (sum-across-lanes) to get the final scalar. On Skylake, a tight scalar sum hits ~1 element/cycle (limited by add latency); the AVX2 version with 4 accumulators to break the dependency chain hits ~32 elements/cycle. That's the real-world 4-8x win, and it's why BLAS, ffmpeg, and JSON parsers like simdjson live and die by SIMD.

The alignment gotcha: Aligned loads (vmovaps) require 32-byte alignment for AVX2. Unaligned (vmovups) works on any address but historically cost more — on modern CPUs the penalty is small unless the load crosses a cache line (64 bytes). Allocate with aligned_alloc(32, n) or posix_memalign.

Rule of thumb: Speedup ≈ (vector_width_bits / element_size_bits) × 0.5 to 0.8. So AVX2 floats: 8 × 0.6 ≈ 5x realistic. AVX-512 doubles: 8 × 0.5 ≈ 4x (and watch for downclocking — heavy AVX-512 drops core frequency 10-20% on older Intel chips, which can erase the win if surrounding code is scalar).

Check /proc/cpuinfo flags (avx2, avx512f) or use __builtin_cpu_supports("avx2") for runtime dispatch.

See it in action: Check out Sasha Goldshtein — The Vector in Your CPU: Exploiting SIMD for Superscalar Performance by DotNext — конференция для .NET‑разработчиков to see this theory applied.

Key Takeaway: SIMD trades scalar simplicity for 4-8x throughput by packing multiple values into wide registers and operating on them with a single instruction — but only if your data is aligned, contiguous, and free of loop-carried dependencies.

All newsletters