Gradient-based Automatic Per-Weight Mixed
Precision Quantization for Neural Networks
On-Chip
Chang Sun
∗
1
,
2
, Thea K.
̊
Arrestad
1
, Vladimir Loncar
3
,
4
, Jennifer Ngadiuba
5
, Maria Spiropulu
2
1
ETH Z
̈
urich, Z
̈
urich, Switzerland,
2
California Institute of Technology, Pasadena, CA, USA
3
Massachusetts Institute of Technology, Cambridge, MA, USA
4
Institute of Physics Belgrade, Serbia
5
Fermi National Accelerator Laboratory, Batavia, IL, USA,
Email:
{
chang.sun, thea.aarrestad, vladimir.loncar, jennifer.ngadiuba, maria.spiropulu
}
@cern.ch
*: Corresponding author
Abstract
—Model size and inference speed at deployment time,
are major challenges in many deep learning applications. A
promising strategy to overcome these challenges is quantization.
However, a straightforward uniform quantization to very low
precision can result in significant accuracy loss. Mixed-precision
quantization, based on the idea that certain parts of the network
can accommodate lower precision without compromising perfor-
mance compared to other parts, offers a potential solution. In
this work, we present High Granularity Quantization (HGQ), an
innovative quantization-aware training method designed to fine-
tune the per-weight and per-activation precision in an automatic
way for ultra-low latency and low power neural networks
which are to be deployed on FPGAs. We demonstrate that
HGQ can outperform existing methods by a substantial margin,
achieving resource reduction by up to a factor of 20 and latency
improvement by a factor of 5 while preserving accuracy.
I. I
NTRODUCTION
Edge computing has significantly increased the importance
of real-time deep neural network (DNN) inference on spe-
cialized hardware [1]. The typical latency threshold for real-
time inference is
O
(1)
ms [2], [3], [4]. Nevertheless, certain
domains require sub-microsecond inference times. At the
CERN Large Hadron Collider (LHC), detectors generate tens
of terabytes of data every second from collisions occurring
every 25 nanoseconds. This data throughput is managed by a
real-time selection system, the trigger. This system determines
the fate of each collision event - whether it should be preserved
for analysis or discarded - with a decision-making latency
ceiling of a few microseconds [5], [6]. The trigger’s precision
is vital to retain only interesting events, thereby managing the
bandwidth effectively and reducing the event rate significantly.
The system consists of
O
(1000)
field programmable gate ar-
rays (FPGAs) mounted on custom boards. Several algorithms
are running in parallel on each FPGA. As a result, resources
are scarce and the memory footprint of each algorithm should
be minimal. In anticipation of the LHC’s upgrade to the
High Luminosity-LHC (HL-LHC) [7], which will multiply the
collision rate considerably by a factor of
2
∼
3
comparing
the current one [5], [6], and machine learning techniques are
being explored to enhance the speed and accuracy of the
computational tasks in the hardware trigger.
However, integrating demanding models - when resource
consumption and latency are strictly limited - without compro-
mising performance is a hurdle. Efforts in recent years have
focused on algorithmic efficiency, with strategies ranging from
the design of compact networks to weight pruning and quan-
tization [8], [9]. Quantization converts model parameters into
lower-precision formats, causing some loss in performance.
Although post-training is computationally cheaper to perform,
the loss in performance is significant in general compared to
the full-precision baseline. To mitigate this, quantization-aware
training has been proposed, which adheres to a fixed numerical
precision throughout training to mitigate this performance
degradation.
The satisfy the latency requirements, neural networks on
FPGAs for LHC physics experiments are usually fully unrolled
and pipelined - all arithmetic operations are done by a different
component in the circuit without overlapping, maximizing
throughput and minimizing latency. To explore this property,
recent researches [10], [11] have suggested that applying
varying levels of quantization to different layers could further
optimize accuracy against computational costs.
In this paper, we introduce the high-granularity quantization
(HGQ) method, which allows models to be trained quantiza-
tion aware at arbitrary granularity: In contrast to what is done
in the QAT library
QKeras
, where weights and activations are
processed in layerwise blocks for quantization, HGQ enables
weights and activations within one layer to have different
bitwidths. For a fully unrolled implementation, we can allow
every weight and activation to have its own unique bitwith. We
illustrate the key difference between the HGQ method and
the conventional block-wise quantization methods in Fig. I.
Optimizing quantization parameter at higher granularity allows
HGQ to find a better trade-off relation between model accuracy
and resource consumption. Furthermore, by optimizing these
1
arXiv:2405.00645v1 [cs.LG] 1 May 2024
individual bitwidths alongside the network using gradient
descent, the need for training the network multiple times to
search for a favorable quantization bitwidth for each block of
the network could also be eliminated.
When multiplication operations in neural networks primar-
ily involve low-bitwidth operands implemented with look-up
tables (LUTs), HGQ could demonstrate a substantial reduction
in on-chip resource consumption by eliminating unnecessary
computations without compromising performance. Depending
on the specific task, we demonstrate that HGQ has the
potential to outperform
AutoQKeras
and achieve resource
reduction by up to a factor of 20, and latency improvement
by a factor of 5 while preserving accuracy.
A functional HGQ framework has been developed us-
ing Tensorflow and Keras, and we have open-sourced it as
free software. The Vivado/Vitis FPGA back-end is supported
through integration with
hls4ml
. The library guarantees an
exact correspondence between the software and firmware mod-
els, provided that no numeric overflow occurs and intermediate
values are representable by
float32
.The work presented in
this paper makes the following contributions:
•
We present a new algorithm for obtaining surrogate gra-
dients of parameter bitwidths, from both the loss function
and the estimated model resource consumption, enabling
full gradient-based optimization of bitwidths;
•
We enable heterogeneous quantization of a specific model
at arbitrary granularity up to per-parameter level, aiming
to minimize hardware resource usage while preserving
high accuracy. This approach naturally includes sparse
pruning of network parameters by setting their bitwidth
to zero, further reducing resource-cost;
•
We have made this library easily available online in an
easy-to-use library, called
HGQ
1
, where simple drop-
in replacement of Tensorflow Keras layers makes it
straightforward for users to transform Keras models to
their equivalent deep heterogeneously quantized versions,
which are trained quantization aware;
•
We have added support for quantized HGQ models in
the library,
hls4ml
, which converts these pre-trained
quantized models into highly-parallel FPGA firmware for
ultra low-latency inference.
•
Using HGQ in combination with
hls4ml
ensures exact
bit-level accuracy between the HGQ software model and
the corresponding firmware model, making the library
safe and easy to use for non-experts;
•
We propose a new metric called Effective Bit Operations
(EBOPs) for a more accurate estimation of on-chip re-
source consumption;
•
We demonstrate a resource reduction of up to 95% and
a 5-fold improvement in latency, all while maintaining
accuracy compared to other state-of-the-art methods.
II. R
ELATED WORK
Network compression has been shown to be an effective
way to reduce the computational cost of neural networks
1
https://github.com/calad0i/HGQ
on FPGAs. Quantization is a widely adopted method for
compressing deep neural networks (DNNs) for implementing
them on hardware devices such as FPGAs or ASICs. Previous
studies have utilized low precision quantization, such as binary
or ternary, across networks to enhance throughput and reduce
latency. Binary quantization restricts weights to
α
×−
1
,
1
, and
ternary to
α
×−
1
,
0
,
1
, with
α
as a scaling factor. Key examples
include DoReFa Net [12], ABC-net [13], Binaryconnect [14],
XNOR-net [15], TWN [16], TTQ [17], and [18]. These
methods achieve high compression but at the cost of reduced
performance compared to standard floating-point networks.
Using binary network principles, several studies have moved
to multi-bit network designs that represent numbers through
binary bases and values, highlighted in works like [19], [20],
[13], [21], [22]. Mix&Match [23], in particular, uses power-
of-two bases for better hardware compatibility.
Many studies have investigated heterogeneous quantization
with layer-specific precision to lessen the performance loss due
to quantization. In particular, in HAQ [24] utilizes reinforce-
ment learning to find the best bitwidth configuration. HAWQ,
HAWQ-V2, PyHessian, and Q-BERT [25], [26], [27], [28] fo-
cus on optimizing bitwidths through hessian-aware techniques.
DNAS [29] and AutoQKeras [10] optimize bitwidths and net-
work architecture simultaneously,with DNAS using stochastic
sampling from a super network and AutoQKeras employing
gradient-free methods like Gaussian Process, Hyperband, and
stochastic search for hyperparameter optimization. Similarly,
Meta-ML [30] applies iterative optimization to various hyper-
parameters, including bitwidths, weight pruning, and model
architectures.
Some works, like RVQuant [31], BitsandBytes [32], and
SpQR [33], have investigated heterogeneous quantization
down to the sub-layer level, offloading outlier weights to
higher precision formats primarily for model compression
for large models rather than significant performance gains
on FPGAs. AutoQ [34] utilizes reinforcement learning to
optimize bitwidths for kernel weights and activations. A study
more aligned with ours is the recent FILM-QNN [35], which
optimizes weight and activation precision in a manner con-
ducive to hardware efficiency. It categorizes convolution layer
filters into groups of low and high precision, assigning them
based on anticipated quantization loss for each filter.
Pruning is another technique used to compress neural net-
works, enhancing their speed during hardware inference. This
method involves removing weights that have minimal impact
on the overall accuracy of the network. This concept was
first introduced in [36], and was applied to neural networks
in [37]. Pruning can be categorized as structured, involving
the removal of weights in specific blocks (as in [38], [39],
[40]), or unstructured, targeting individual weights (as in
[41], [42], [43], [44], [45], [40]). In this work, we consider
pruning as a form of quantization where pruned weights are
effectively quantized to zero bits. The
QKeras
[10] frame-
work, like ours, aims to train and optimize neural networks
for deployment on FPGAs.
Qkeras
is developed on top of
Tensorflow Keras [46] and leverages
hls4ml
[47] for FPGA
2
Quantization
HGQ
Pruning
Quantization
+ Pruning
Fig. I.
Overview of the HGQ method, showing activations (circles) and weights (lines) with thickness indicating bitwidth. Connections are dropped when
weight or activation values are constantly zero. Top left: baseline network with high precision throughout. Top right: network quantized layer-wise, e.g.,
using QKeras. Bottom right: network both quantized layer-wise and pruned. Bottom left: network quantized using HGQ, applying more detailed quantization
and assigning high bitwidths only where needed, on a per-weight and activation basis. This approach reduces resource use by maximally utilizing FPGA’s
heterogeneous computation.
deployment. It specializes in training and optimizing neural
networks, allowing for the use of arbitrary precision fixed-
point numbers for both weights and activations. AutoQK-
eras, a feature within Qkeras, enables automatic adjustment
of quantization settings for each layer using a gradient-free
approach. This can lead to significant compression, including
the use of binary or ternary networks. Typically,
hls4ml
is employed as the backend for deploying on FPGAs. It
specializes in training and optimizing neural networks, allow-
ing for the use of arbitrary precision fixed-point values for
both weights and activations.
AutoQKeras
, a feature within
Qkeras
, enables automatic tuning of quantization settings
for each layer using a gradient-free approach. This can lead
to significant compression, including the use of binary or
ternary networks [11]. Brevitas [48] serves as the PyTorch [49]
equivalent of
Qkeras
, commonly paired with the
FINN
and
FINN-R
frameworks from AMD Research [50], [51] for
deploying on AMD FPGAs.
III. H
IGH
G
RANULARITY
Q
UANTIZATION
In this paper, we introduce High Granularity Quantization
(HGQ). This is a novel quantization approach that allows for
up to individual precision levels within a single layer, offering
the unique capability for each parameter in a network to have
its own bitwidth. We begin this section by outlining the fun-
damentals of quantization and Quantization-Aware Training
(QAT). Subsequently, we introduce an innovative gradient-
based technique for auto-tuning the quantization bitwidth
during training. A comprehensive explanation of the HGQ
method and its algorithm follows. This approach is designed
to improve the accuracy-resource/latency balance compared
to previously studied block-wise heterogeneous quantization
methods in neural networks.
A. Quantization
Quantization is a map, henceforth referred to as
f
q
, that
transforms a real number into a finite set of discrete values,
mapping from the set of real numbers
R
to a discrete subset
Q
≡{
q
i
|
q
i
+1
> q
i
}⊂
R
. For hardware efficiency, we ensure
that quantized weights and activations are represented as fixed-
point numbers, a common practice in hardware for numerical
representation. A fixed-point number is essentially an integer
scaled by a predefined factor, typically powers of two. It is
characterized by its bitwidth (total number of bits) and the
number of bits allocated for the integer portion. The inclusion
of the sign bit in the integer part, for signed numbers, varies
by convention. In this context, we adhere to the convention
used in Xilinx
®
Vivado
®
/Vitis
®
HLS, which includes the sign
bit in the integer part if present. Adhering to the standard for
a fixed-point number with
b
∈
N
+
bits, where
i
∈
Z
bits are
dedicated to the integer part. We define
f
as the number of
fractional bits, calculated by
f
≡
b
−
i
. For signed numbers,
the representable range is
[
−
2
i
−
1
,
2
i
−
1
−
2
−
f
]
, with a step
size of
2
−
f
. For unsigned numbers, the range is [0,
2
i
−
2
−
f
],
sharing the same step size.
One way of quantizing a real number into a fixed-point for-
mat,
fixed<b,i>
, can be expressed by a rounding function
as follows:
f
q
(
x
) =