How to use FixedPoint16 for DNNs on FPGAs?

2026-06-05

Stack Overflow: View Question

Tags: conv-neural-network, fpga, hardware-acceleration, fixed-point

Score: 1 | Views: 33

The asker is designing a CNN convolution accelerator on FPGA using Q1.15 fixed-point arithmetic. Inputs are normalized to [0, 1) as Q1.15, weights to [-0.5, 0.5) also as Q1.15. The math is clean for a single MAC: Q1.15 × Q1.15 = Q2.30, and accumulating N such products grows the integer portion by roughly log2(N) bits. The problem is what to do with that fat accumulator when feeding the next layer, which also expects Q1.15.

Why it's hard: A 3×3×C conv with C=64 means 576 MACs per output pixel. The accumulator naturally wants to be Q11.30 (about 41 bits) to avoid overflow. You can't just truncate the top bits (that overflows) and you can't just shift right by 15 (you'd saturate everything to ±1.0). And unlike floating point, the answer to "where does the binary point live after the next layer" is not automatic — it depends on the actual dynamic range of activations, which differs per layer.

The standard approach in fixed-point DNN inference is per-layer requantization:

Gotchas:

The challenge: Fixed-point CNN inference isn't really about Q-format arithmetic — it's about per-layer requantization with offline-calibrated scale factors, which is the part textbooks skip but determines whether the accelerator actually matches the reference model.

All newsletters