Stack Overflow Unanswered: How to use FixedPoint16 for DNNs on FPGAs?

How to use FixedPoint16 for DNNs on FPGAs?

2026-06-05

Stack Overflow: View Question

Tags: conv-neural-network, fpga, hardware-acceleration, fixed-point

Score: 1 | Views: 33

The asker is designing a CNN convolution accelerator on FPGA using Q1.15 fixed-point arithmetic. Inputs are normalized to [0, 1) as Q1.15, weights to [-0.5, 0.5) also as Q1.15. The math is clean for a single MAC: Q1.15 × Q1.15 = Q2.30, and accumulating N such products grows the integer portion by roughly log2(N) bits. The problem is what to do with that fat accumulator when feeding the next layer, which also expects Q1.15.

Why it's hard: A 3×3×C conv with C=64 means 576 MACs per output pixel. The accumulator naturally wants to be Q11.30 (about 41 bits) to avoid overflow. You can't just truncate the top bits (that overflows) and you can't just shift right by 15 (you'd saturate everything to ±1.0). And unlike floating point, the answer to "where does the binary point live after the next layer" is not automatic — it depends on the actual dynamic range of activations, which differs per layer.

The standard approach in fixed-point DNN inference is per-layer requantization:

Keep the wide accumulator (e.g., 32–48 bits) through the entire dot product. Don't round mid-sum; rounding noise compounds.
After the bias add and activation, apply a per-layer scale factor M and shift s that maps the accumulator's actual observed range back into Q1.15. These are calibrated offline from a representative dataset (this is what TFLite, QNNPACK, and the Google "quantization white paper" by Jacob et al. do).
Implement the rescale as (acc * M) >> s using a fixed multiplier — cheap in hardware, one DSP slice.
Saturate (not wrap) on the way down to Q1.15 to handle the rare outliers gracefully.

Gotchas:

Symmetric vs. asymmetric: Q1.15 with range [-1, 1) is symmetric and signed. ReLU activations are non-negative, so you waste a bit of dynamic range unless you switch to UQ0.16 or an asymmetric zero-point after ReLU.
Rounding mode: Truncation (arithmetic shift right) biases toward negative infinity and visibly hurts accuracy on deep networks. Use round-to-nearest-even, or at minimum add (1 << (s-1)) before the shift.
Batch norm folding: BN scales must be folded into the conv weights before quantization, or you'll bleed precision.
Weight range: Constraining weights to [-0.5, 0.5) throws away a precision bit. Most quantization schemes let each layer pick its own scale, so weights use the full Q1.15 range.
The first and last layers usually need higher precision (INT16 or even FP) — accuracy regressions almost always trace back here.

The challenge: Fixed-point CNN inference isn't really about Q-format arithmetic — it's about per-layer requantization with offline-calibrated scale factors, which is the part textbooks skip but determines whether the accelerator actually matches the reference model.

All newsletters