SRT Division: How Hardware Divides Numbers (and Why the Pentium Got It Wrong)

2026-06-06

Multiplication is easy in hardware — you can compress partial products in parallel with a Wallace tree. Division is the opposite: it's fundamentally serial. Each quotient digit depends on the previous remainder. SRT division (Sweeney, Robertson, Tocher) is how every modern FPU squeezes more bits out of each cycle.

The naive approach — restoring division — is just paper-and-pencil long division in binary. At each step, subtract the divisor from the partial remainder. If the result is negative, "restore" by adding the divisor back, and the quotient bit is 0. Otherwise the quotient bit is 1. One bit per cycle. Slow, and the restore step is wasteful.

Non-restoring division skips the restore: if the remainder goes negative, the next step adds the divisor instead of subtracting, and the quotient digit is -1 instead of +1. Quotient digits are now in {-1, +1}, which gets converted to standard binary at the end.

SRT division generalizes this to a redundant digit set like {-2, -1, 0, +1, +2} for radix-4. The redundancy is the trick: because multiple digit choices can produce a valid remainder, you don't need an exact comparison with the divisor — you only need to inspect the top few bits of the partial remainder and divisor and look up the next quotient digit in a small table. Radix-4 SRT generates 2 quotient bits per cycle; radix-8 generates 3, at the cost of a bigger table and harder multiples (×3 and ×5 aren't shift-and-add).

The Pentium FDIV bug (1994) is the canonical SRT cautionary tale. Intel's Pentium used a radix-4 SRT lookup table with 2048 entries. Five entries were missing — accidentally programmed to 0 instead of +2. For most divisions the lookup never hit those cells, but for specific operand patterns (the famous example: 4195835 / 3145727) the wrong digit got selected, producing an answer wrong starting around the 5th significant decimal digit. Intel ate a $475M charge replacing chips.

Rule of thumb: A radix-r SRT divider produces log₂(r) quotient bits per cycle. For a 53-bit double-precision mantissa, radix-4 needs ~27 cycles, radix-8 needs ~18, radix-16 needs ~14. Doubling the radix roughly halves the latency but quadruples the lookup table and forces uglier divisor multiples.

Modern FPUs (Intel Skylake, AMD Zen) use radix-16 SRT or hybrid Newton-Raphson schemes. Division is still 10-25× slower than multiplication, which is why compilers reciprocal-multiply by constants whenever they can.

Key Takeaway: SRT division uses a redundant digit set so a small lookup table — not an exact compare — picks each quotient digit, letting hardware produce multiple quotient bits per cycle; getting even five table entries wrong cost Intel half a billion dollars.

All newsletters