2026-05-08
The asker has hand-tuned AVX2 intrinsics that perform beautifully under MSVC, but when compiled with gcc the code runs ~2× slower, and with Intel's ICX (the LLVM-based successor to ICC) it's a brutal 3–5× slower. They suspect the non-MSVC compilers are recognizing the intrinsic sequence as a known idiom and rewriting it into a "canonical" form that defeats their micro-optimizations.
Why this is hard. Intrinsics aren't really opaque — they're hints. MSVC tends to emit instructions almost 1:1 with the intrinsic stream, treating them as inline assembly with a register allocator on top. gcc and clang/ICX, by contrast, lower intrinsics into IR and let the optimizer do its thing: combining adjacent shuffles, hoisting loads, recognizing reduction patterns, even swapping instructions for "equivalent" ones with different latency/throughput trade-offs. When you've already chosen a specific shuffle/blend sequence because of measured port pressure on a particular µarch, the optimizer's "improvement" can be a regression.
Sketch of an approach.
-S -masm=intel on each compiler and diff the inner loop. Run under perf stat -d or VTune to see whether the regression is front-end (decoder/uop cache), back-end (port contention), or memory (extra loads/spills).-march=haswell or -march=native consistently — gcc's default tuning may target generic x86-64-v3 with conservative scheduling.asm volatile("" ::: "memory") or use __builtin_ia32_* equivalents to discourage reordering. As a last resort, drop the hottest 5–10 instructions to inline asm with explicit register constraints.-fno-slp-vectorize, -fno-vectorize, and -mprefer-vector-width=256 are worth toggling. ICX may also be re-vectorizing already-vectorized code into 512-bit ops on Skylake-X with downclocking penalties.mul+add into FMA, changing latency. -ffp-contract=off can rule this out.Gotchas. "Equivalent" is in the eye of the beholder: a vpshufb + vpor pair the asker chose for port 5 + port 0/1 parallelism can be "simplified" to a single vpermd that's higher latency. Also, MSVC's apparent edge sometimes vanishes once gcc gets -O3 -funroll-loops -fno-tree-vectorize applied to the surrounding scalar glue — the regression may not be in the intrinsics themselves but in how the wrapper code is scheduled around them.
