AVX2 intrinsics code portability issue with Intel ICX & gcc - resulting code is **much** slower than MSVC

2026-05-08

Stack Overflow: View Question

Tags: c, gcc, benchmarking, intrinsics, icx

Score: 5 | Views: 121

The asker has hand-tuned AVX2 intrinsics that perform beautifully under MSVC, but when compiled with gcc the code runs ~2× slower, and with Intel's ICX (the LLVM-based successor to ICC) it's a brutal 3–5× slower. They suspect the non-MSVC compilers are recognizing the intrinsic sequence as a known idiom and rewriting it into a "canonical" form that defeats their micro-optimizations.

Why this is hard. Intrinsics aren't really opaque — they're hints. MSVC tends to emit instructions almost 1:1 with the intrinsic stream, treating them as inline assembly with a register allocator on top. gcc and clang/ICX, by contrast, lower intrinsics into IR and let the optimizer do its thing: combining adjacent shuffles, hoisting loads, recognizing reduction patterns, even swapping instructions for "equivalent" ones with different latency/throughput trade-offs. When you've already chosen a specific shuffle/blend sequence because of measured port pressure on a particular µarch, the optimizer's "improvement" can be a regression.

Sketch of an approach.

Gotchas. "Equivalent" is in the eye of the beholder: a vpshufb + vpor pair the asker chose for port 5 + port 0/1 parallelism can be "simplified" to a single vpermd that's higher latency. Also, MSVC's apparent edge sometimes vanishes once gcc gets -O3 -funroll-loops -fno-tree-vectorize applied to the surrounding scalar glue — the regression may not be in the intrinsics themselves but in how the wrapper code is scheduled around them.

The challenge: Intrinsics promise hardware-level control, but only MSVC really honors that promise — gcc and ICX treat them as suggestions and freely rewrite them, making "portable" SIMD code a research project in compiler-defeating micro-pragmas.

All newsletters