2026-05-07
Link: https://easylang.online/blog/qsort
HN Discussion: 2 points, 0 comments
Quicksort is one of the most studied algorithms in computer science — taught in every undergraduate curriculum, implemented in every standard library, and the subject of countless optimization papers. So when someone claims they can make it meaningfully faster by removing a single language construct, that's worth a closer look.
The premise here is delightfully concrete: branches are expensive on modern CPUs. When the branch predictor guesses wrong, the pipeline stalls and dozens of cycles evaporate. In a partitioning loop where the comparison outcome is essentially random (especially on already-disordered data), the predictor has no signal to latch onto, and the mispredict rate creeps toward 50%. That's catastrophic for a hot loop.
The fix is branchless programming: replace the conditional swap with arithmetic that produces the same result regardless of which value is larger. Something like:
This pattern shows up in cryptographic code (where it's mandatory for constant-time operations to avoid timing side-channels), in SIMD-heavy numerical kernels, and increasingly in mainstream sorts — Andrei Alexandrescu's talks on this topic and the BlockQuicksort paper from 2016 are the canonical references. Modern pdqsort implementations in Rust and C++ use similar techniques.
What makes this particular post interesting is the venue — easylang is a teaching language, which means the author is demonstrating these techniques in an accessible context rather than buried in C++ template metaprogramming or hand-tuned assembly. For developers who've heard "branches are slow" but never seen the transformation applied to an algorithm they already understand, this is a great gateway.
The deeper lesson worth internalizing: algorithmic complexity is not the only thing that matters. Two O(n log n) sorts can differ by 2-3x in real-world performance based purely on how friendly they are to the CPU's branch predictor and cache hierarchy. The textbook view of algorithms — counting comparisons and swaps — was calibrated for hardware that no longer exists.
For anyone writing performance-sensitive code, this reframes optimization away from "find a better algorithm" toward "make the existing algorithm friendlier to the silicon."
