2026-05-05
You already know shift registers — they move data one bit per clock edge. But when your CPU executes x << 7, it doesn't burn seven cycles. It does the entire shift combinationally in one cycle. The circuit that pulls this off is the barrel shifter.
The naive approach — a giant N-to-1 mux per output bit — works but explodes in area. For a 32-bit shifter, each of 32 output bits would need a 32-input mux. That's roughly 1024 mux inputs. Painful.
The clever approach is the logarithmic barrel shifter. You decompose the shift amount into powers of two and build a stage per bit of the shift amount. For a 32-bit shifter, the shift amount is 5 bits: shamt[4:0]. You build five stages:
shamt[0]shamt[1]shamt[2]shamt[3]shamt[4]Each stage is just a row of 2:1 muxes — one per bit. A shift of 13 (01101) cascades through stages: shift 1, no shift, shift 4, shift 8 — total 13. Five mux delays end-to-end.
Real-world example: ARM's classic three-operand instruction ADD R0, R1, R2, LSL #3 adds R1 to (R2 shifted left by 3) in a single cycle. The barrel shifter sits in series with the ALU input. This is why ARM assembly feels so expressive — the shifter is free in the datapath. The same block also handles rotates by feeding the shifted-out bits back to the empty positions.
Rule of thumb for cost: An N-bit logarithmic barrel shifter uses N × log₂(N) 2:1 muxes. For 32 bits: 32 × 5 = 160 muxes — versus ~1024 for the brute-force version. The delay is log₂(N) mux delays, so a 64-bit shifter only adds one more stage (six total) over a 32-bit one.
Watch out for two design subtleties: arithmetic right shifts must replicate the sign bit into vacated positions (extra mux input per stage), and funnel shifters — used in unaligned memory access and rotates — concatenate two registers before shifting, doubling the input width but keeping the same stage count.
