Inline Assembly and Compiler Intrinsics

2026-04-22

Sometimes the compiler isn't enough. You need a specific CPU instruction — a hardware CRC, an atomic compare-and-swap, a cache flush — and C doesn't expose it. You have two tools: inline assembly and compiler intrinsics. Knowing when to use each is what separates controlled precision from self-inflicted debugging nightmares.

Inline assembly embeds raw machine instructions inside C/C++ code. GCC and Clang use extended asm syntax:

uint64_t rdtsc_value;
asm volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %%rax"
    : "=a" (rdtsc_value)  /* output: result in RAX */
    :                      /* no inputs */
    : "rdx"                /* clobbers RDX */
);

This reads the CPU timestamp counter — something with no C equivalent. The critical parts of the syntax are:

Get the clobber list wrong and the compiler will silently overwrite a live register. This is the single most common inline asm bug. Rule of thumb: if your inline asm block touches more than 3 registers, you're probably better off writing a standalone .S file.

Compiler intrinsics are the pragmatic alternative. They look like function calls but compile to specific instructions, and the compiler still understands the data flow — it can allocate registers, reorder safely, and optimize around them:

// x86 popcount via intrinsic
#include <immintrin.h>
int count = _mm_popcnt_u64(bitmask);

// ARM CRC32
#include <arm_acle.h>
uint32_t crc = __crc32cb(init, byte);

Real-world example: Linux's arch/x86/include/asm/bitops.h implements ffs() (find-first-set-bit) using the bsf instruction via inline asm. Doing this in pure C requires a loop or a lookup table — the single instruction is 10-20x faster on hot paths like scheduler bitmask scanning. At 1 GHz, bsf takes ~3 cycles (3 ns) versus a naive loop averaging 16 iterations at 1 cycle each (16 ns).

When to choose which:

Always compile with -S and inspect the output. Verify the compiler emitted exactly what you intended. Trust but verify.

See it in action: Check out A walkthrough guide to implementing a compiler intrinsic by FOSDEM to see this theory applied.
Key Takeaway: Prefer compiler intrinsics for specific CPU instructions whenever available; reserve inline assembly for cases where you need exact instruction control the compiler cannot provide, and always verify the generated output with -S.

All newsletters