Stack Overflow Unanswered: AVX2 intrinsics code portability issue with Intel ICX & gcc - resulting code is **much** slower than MSVC

AVX2 intrinsics code portability issue with Intel ICX & gcc - resulting code is much slower than MSVC

2026-05-08

Stack Overflow: View Question

Tags: c, gcc, benchmarking, intrinsics, icx

Score: 5 | Views: 121

The asker has hand-tuned AVX2 intrinsics that perform beautifully under MSVC, but when compiled with gcc the code runs ~2× slower, and with Intel's ICX (the LLVM-based successor to ICC) it's a brutal 3–5× slower. They suspect the non-MSVC compilers are recognizing the intrinsic sequence as a known idiom and rewriting it into a "canonical" form that defeats their micro-optimizations.

Why this is hard. Intrinsics aren't really opaque — they're hints. MSVC tends to emit instructions almost 1:1 with the intrinsic stream, treating them as inline assembly with a register allocator on top. gcc and clang/ICX, by contrast, lower intrinsics into IR and let the optimizer do its thing: combining adjacent shuffles, hoisting loads, recognizing reduction patterns, even swapping instructions for "equivalent" ones with different latency/throughput trade-offs. When you've already chosen a specific shuffle/blend sequence because of measured port pressure on a particular µarch, the optimizer's "improvement" can be a regression.

Sketch of an approach.

Diagnose first. Compile with -S -masm=intel on each compiler and diff the inner loop. Run under perf stat -d or VTune to see whether the regression is front-end (decoder/uop cache), back-end (port contention), or memory (extra loads/spills).
Pin the µarch. Use -march=haswell or -march=native consistently — gcc's default tuning may target generic x86-64-v3 with conservative scheduling.
Block the rewriter. If a specific sequence is being fused, insert asm volatile("" ::: "memory") or use __builtin_ia32_* equivalents to discourage reordering. As a last resort, drop the hottest 5–10 instructions to inline asm with explicit register constraints.
Check ICX flags. ICX inherits clang's aggressive vectorizer; -fno-slp-vectorize, -fno-vectorize, and -mprefer-vector-width=256 are worth toggling. ICX may also be re-vectorizing already-vectorized code into 512-bit ops on Skylake-X with downclocking penalties.
Look for FMA contraction. gcc/clang may fuse mul+add into FMA, changing latency. -ffp-contract=off can rule this out.

Gotchas. "Equivalent" is in the eye of the beholder: a vpshufb + vpor pair the asker chose for port 5 + port 0/1 parallelism can be "simplified" to a single vpermd that's higher latency. Also, MSVC's apparent edge sometimes vanishes once gcc gets -O3 -funroll-loops -fno-tree-vectorize applied to the surrounding scalar glue — the regression may not be in the intrinsics themselves but in how the wrapper code is scheduled around them.

The challenge: Intrinsics promise hardware-level control, but only MSVC really honors that promise — gcc and ICX treat them as suggestions and freely rewrite them, making "portable" SIMD code a research project in compiler-defeating micro-pragmas.

All newsletters