2026-06-05
Stack Overflow: View Question
Tags: conv-neural-network, fpga, hardware-acceleration, fixed-point
Score: 1 | Views: 33
The asker is designing a CNN convolution accelerator on FPGA using Q1.15 fixed-point arithmetic. Inputs are normalized to [0, 1) as Q1.15, weights to [-0.5, 0.5) also as Q1.15. The math is clean for a single MAC: Q1.15 × Q1.15 = Q2.30, and accumulating N such products grows the integer portion by roughly log2(N) bits. The problem is what to do with that fat accumulator when feeding the next layer, which also expects Q1.15.
Why it's hard: A 3×3×C conv with C=64 means 576 MACs per output pixel. The accumulator naturally wants to be Q11.30 (about 41 bits) to avoid overflow. You can't just truncate the top bits (that overflows) and you can't just shift right by 15 (you'd saturate everything to ±1.0). And unlike floating point, the answer to "where does the binary point live after the next layer" is not automatic — it depends on the actual dynamic range of activations, which differs per layer.
The standard approach in fixed-point DNN inference is per-layer requantization:
M and shift s that maps the accumulator's actual observed range back into Q1.15. These are calibrated offline from a representative dataset (this is what TFLite, QNNPACK, and the Google "quantization white paper" by Jacob et al. do).(acc * M) >> s using a fixed multiplier — cheap in hardware, one DSP slice.Gotchas:
[-1, 1) is symmetric and signed. ReLU activations are non-negative, so you waste a bit of dynamic range unless you switch to UQ0.16 or an asymmetric zero-point after ReLU.(1 << (s-1)) before the shift.[-0.5, 0.5) throws away a precision bit. Most quantization schemes let each layer pick its own scale, so weights use the full Q1.15 range.