FERMILAB-PUB-24-0213-CMS
CaltechAUTHORS:10.7907/hq8jd-rhg30
Gradient-based Automatic Mixed Precision Quantization
for Neural Networks On-Chip
Chang Sun,
1, 2,
∗
Thea K.
̊
Arrestad,
1
Vladimir Loncar,
3, 4
Jennifer Ngadiuba,
5
and Maria Spiropulu
2
1
ETH Zurich (Zurich, Switzerland)
2
California Institute of Technology (CA, USA)
3
Massachusetts Institute of Technology (MA, USA)
4
Institute of Physics Belgrade (Belgrade, Serbia)
5
Fermi National Accelerator Laboratory (IL, USA)
Model size and inference speed at deployment time, are major challenges in many deep learn-
ing applications. A promising strategy to overcome these challenges is quantization. However, a
straightforward uniform quantization to very low precision can result in significant accuracy loss.
Mixed-precision quantization, based on the idea that certain parts of the network can accommodate
lower precision without compromising performance compared to other parts, offers a potential solu-
tion. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-
aware training method that could fine-tune the per-weight and per-activation precision by making
them optimizable through gradient descent. This approach enables ultra-low latency and low power
neural networks on hardware capable of performing arithmetic operations with an arbitrary number
of bits, such as FPGAs and ASICs. We demonstrate that HGQ can outperform existing methods by
a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement
by a factor of 5 while preserving accuracy.
I. INTRODUCTION
Edge computing has significantly increased the impor-
tance of real-time deep neural network inference on spe-
cialized hardware [1]. While the typical latency thresh-
old for real-time inference applications is
O
(1) ms [2–4],
certain domains require sub-microsecond inference times.
At the CERN Large Hadron Collider (LHC) [5], detec-
tors generate hundreds of terabytes of data every second
from proton-proton collisions occurring every 25 nanosec-
onds. This enormous data throughput is reduced by
the
trigger
, a hardware system filtering data in real-
time at the same rate. This detector subsystem de-
termines the fate of each collision event – whether it
should be preserved for offline processing or discarded –
with a decision-making latency ceiling at a few microsec-
onds [6, 7]. The trigger’s accuracy is vital to retain only
the interesting events for physics studies, thereby man-
aging the downstream bandwidth effectively by reducing
the data rate by two orders of magnitude. The system
consists of
O
(1000) field programmable gate arrays (FP-
GAs), where several algorithms are running in parallel
on each FPGA. As a result, resources are scarce and the
spacial complexity of each algorithm needs to be mini-
mal. In anticipation of the LHC’s upgrade to the High
Luminosity-LHC (HL-LHC) [8], which will increase data
rates and complexity by a factor of 1-2, machine learn-
ing techniques are being actively explored to enhance the
speed and accuracy of the algorithms in the future trigger
∗
E-mail: chsun@cern.ch
system [6, 7]. However, integrating demanding models
under such strict resource and latency constraints with-
out compromising performance is a hurdle. To satisfy
the latency requirements, neural networks on FPGAs
for LHC physics experiments are usually fully unrolled
– all arithmetic operations are done by different compo-
nents in the circuit without overlapping – and pipelined
to minimize the latency and maximize the throughput
at the cost of higher resource consumption. Efforts in
recent years have focused on algorithmic efficiency, with
strategies ranging from the design of compact networks
to weight pruning and quantization [9, 10].
Quantization is a model compression technique that
converts model parameters into lower-precision formats,
resulting in some performance degradation in exchange
for a smaller model size and/or faster inference. To quan-
tize a neural network, one can either reduce the pre-
cision of its parameters after training or train the net-
work directly with low precision parameters. These two
approaches are referred to as post-training quantization
(PTQ) and quantization-aware training (QAT), respec-
tively. While PTQ is computationally cheaper to perform
in general, it usually induces a more significant loss in
performance compared to QAT under the same compres-
sion ratio. To aim for the best possible trade-off between
model performance and resource consumption, we follow
the QAT approach.
In this work, we introduce high-granularity quanti-
zation (HGQ), a novel QAT method that optimizes
the quantization bitwidths during training using gra-
dients, which enables models to be quantized at arbi-
trary granularity. In contrast to existing methods, where
arXiv:2405.00645v2 [cs.LG] 8 Aug 2024
bitwidths for network parameters are optimized in pre-
defined, structured blocks, HGQ provides more granular
control over which parameters share the same bitwidth.
For models deployed with fully unrolled implementations
like the ones used in the trigger systems, every param-
eter in the network may have its unique bitwidth. We
illustrate the key difference between the HGQ method
and the conventional block-wise quantization methods in
Figure I. Optimizing the bitwidths at higher granularity
allows HGQ to find better trade-offs between the model
performance and resource consumption. Furthermore, by
optimizing these individual bitwidths alongside the net-
work using gradient descent, the need for including the
bitwidths as hyperparameters to be optimized with iter-
ative trainings is eliminated. Depending on the specific
task, we demonstrate that HGQ has the potential to out-
perform other model compression methods and achieve
resource reduction by up to a factor of 20, and latency
improvement by a factor of 5 while preserving the model
performance.
A functional HGQ library has been developed with
Tensorflow
[11] and
Keras
[12], and we have released it
as a free and open-source software. The Vivado/Vitis
®
FPGA back-ends are supported through integration with
hls4ml
[13] – a software tool designed to facilitate the
conversion of machine learning algorithms into hardware
designs, which is specifically optimized for ultra-low la-
tency deployment on FPGAs and application-specific in-
tegrated circuits (ASICs)[14] through High-Level Synthe-
sis (HLS). The HGQ library guarantees an exact corre-
spondence between the software and firmware models,
provided that no numeric overflow occurs and intermedi-
ate values are exactly representable by the floating-point
datatype used in emulation.
The work presented here makes the following contri-
butions:
•
We present a new algorithm for obtaining surro-
gate gradients for the quantization bitwidths, de-
rived from both the loss function and the esti-
mated model resource consumption, enabling full
gradient-based optimization of bitwidths;
•
We propose a new metric named Effective Bit
Operations (EBOPs) for accurate estimation of a
model’s on-chip resource consumption;
•
We enable heterogeneous quantization of a specific
model at arbitrary granularity up to per-parameter
level, aiming to minimize hardware resource usage
while preserving the performance. This approach
automatically includes sparse pruning of the net-
work parameters as their bitwidths reach zero;
•
We have made the HGQ library easily accessible
online[15], and user-friendly: A simple drop-in re-
placement of the
Keras
layers makes it straightfor-
ward for users to transform
Keras
models to their
corresponding heterogeneously quantized versions;
•
We have added support for HGQ-trained mod-
els in the
hls4ml
tool, which converts these
pre-trained quantized models into highly-parallel
FPGA firmware with HLS. We ensure bit-level con-
sistency between the software model and the cor-
responding firmware, making the library safe and
easy to use for non-experts;
•
compared to other state-of-the-art model compres-
sion methods targeting ultra-low latency applica-
tions, we demonstrate a resource reduction of up
to 95% and an improvement in latency of up to 5-
fold by using HGQ, all while maintaining accuracy.
II. RELATED WORK
Quantization is a widely adopted method for com-
pressing deep neural networks (DNNs) for implement-
ing them on specialized hardware devices such as FP-
GAs or ASICs.
Previous studies have utilized ex-
tremely low precision quantization, such as binary or
ternary, across networks to enhance throughput and re-
duce latency. Binary quantization restricts parameters
to
α
×{−
1
,
1
}
(or
α
×{
0
,
1
}
in some convention), and
ternary to
α
×{−
1
,
0
,
1
}
, with
α
being a relatively high-
precision scaling factor. Key examples include DoReFa
Net [16], ABC-net [17], Binaryconnect [18], XNOR-
net [19], TWN [20], TTQ [21], and [22]. While these
methods could achieve high model compression ratios,
they come at the cost of substantially reduced model
performance compared to the corresponding float-point
baselines. Using the same principles as binary networks,
several studies have moved to multi-bit network designs
that represent weights through binary bases and values,
highlighted in works like [17, 23–27].
Many studies have investigated heterogeneous quan-
tization with layer/channel-specific precision to lessen
the performance loss due to quantization. In particular,
HAQ [28] and AutoQ [29] utilize reinforcement learning
to find optimal bitwidth configurations. HAWQ, HAWQ-
V2, PyHessian, Q-BERT, and OBQ [30–34] focus on
optimizing bitwidths with the second-order approxima-
tions of the loss function around the unquantized optimal
weights. DNAS [35] and AutoQKeras [36] optimize the
bitwidths and network architecture simultaneously. IN
particular, DNAS uses stochastic sampling to obtain a
subnetwork from a super network, and AutoQKeras em-
ploys gradient-free methods like Gaussian Process, Hy-
perband, and stochastic search for hyperparameter op-
timizations. Similarly, Meta-ML [37] applies iterative
optimizations to various hyperparameters, including the
bitwidths, weight pruning strategy, and model architec-
ture. FILM-QNN [38] optimizes weight and activation’s
bitwidths in a manner conducive to hardware efficiency
for convolutional neural networks. For each convolutional
layer, it categorizes filters into two groups of lower and
higher precision based on the anticipated loss of perfor-
mance due to quantizing each filter, and arranges them
to utilize the on-board multipliers on FPGAs efficiently.
Heterogeneous quantization at sub-layer/channel gran-
ularity is also studied by other works. RVQuant [39],
2
Quantization
HGQ
Pruning
Quantization
+ Pruning
FIG. I. An illustration of the HGQ method on a dense network. Activation and weights of the network are shown in circles
and lines, with the thickness indicating the corresponding bitwidth. A line/circle is dropped when the corresponding value is
pruned. Top left: baseline network with high precision throughout. Top right: a layer-wise heterogeneously quantized network,
e.g., trained with QKeras. Bottom right: a network that is both layer-wise heterogeneously quantized and unstructured pruned.
Bottom left: a network trained with HGQ with maximum granularity: Each weight and activation has its unique bitwidth.
When a bitwidth reaches zero, the corresponding values is effectively pruned.
BitsandBytes [40], SpQR [41], and SqueezeLLM[42] of-
floads a small fraction of outlier weights to higher pre-
cision formats to mitigate the performance degradation
due to quantization. These works primarily targets for
the weight size reduction of larger models, rather than
efficient inference on specialized hardwares like FPGAs
or ASICs.
Pruning is another technique used to compress neural
networks. It involves the removal of weights in a network
that have minimal impact on the overall performance.
This concept was first introduced in [43], and first applied
to neural networks in [44]. The removal of weights is
sometimes formulated by pruning masks – binary valued
tensors that are multiplied with the weights to zero out
the pruned ones.
Depending on how the pruned weights are arranged,
pruning can be categorized as structured or unstruc-
tured pruning. Structured pruning removes weights in
specific blocks or following certain patterns, usually in
a hardware-friendly manner to speed up the inference,
as in [45–50]. On the other hand, unstructured pruning
targets individual weights for the best compression ratio,
as in [51–55]. Semi-structured pruning targeting specific
hardware accelerators also exists, as in [50, 56].
On methodology side, [34, 44, 48, 50] use Hessian of the
loss function for determining the optimal pruning mask
and weight compensations post-training. [47] formulates
the pruning mask creation as an optimal transport prob-
lem and then relaxes it to be differentiable for training.
[49, 55] directly use trainable pruning masks that are op-
timized along with the weights. [52, 53] remove weights
with the small magnitude iteratively during training. [54]
optimizes the pruning mask by solving it as a constraint
optimization problem during training with the stochastic
Frank-Wolfe algorithm. With a similar objective to this
work, [45] solves the post-training pruning problem with
constraint programming to reduce the network’s on-chip
resource usage.
In this work, we consider pruning as a special form
of quantization, where the pruned weights are quantized
with zero bits. In this way, pruning is automatically done
by optimizing the quantization bitwidths during training.
Closely related to this work, the
QKeras
[36] frame-
work aims to train and optimize neural networks for de-
ployment on FPGAs and ASICs.
Qkeras
is developed
on top of
Keras
and leverages
hls4ml
[57] for hardware
deployment. It enables training and optimization of neu-
ral networks with hardware-friendly fixed-point numbers
for both weights and activations.
AutoQKeras
, a feature
within
Qkeras
, enables automatic adjustment of quan-
tization settings for each layer using gradient-free ap-
proaches.
Brevitas [58] serves as the PyTorch [59] equivalent of
Qkeras
, which is commonly used in pair with the
FINN
and
FINN-R
frameworks from Xilinx Research [60, 61] for
deploying on AMD
®
FPGAs.
III. HIGH GRANULARITY QUANTIZATION
In this work, we introduce High Granularity Quanti-
zation (HGQ), a novel quantization approach with the
unique capability of optimizing the bitwidths in a quan-
tized neural network at arbitrary fine granularity – up
to per-parameter level. At the same time, it provides an
accurate on-chip resource usage estimation, and simulta-
3
neously optimizes the accuracy and resource usage of the
network in a hardware/software co-design fashion. We
begin this section by outlining the fundamentals of quan-
tization and quantization-aware training. Then, we in-
troduce a way to accurately estimate the on-chip resource
consumption of a model. Subsequently, we introduce an
innovative gradient-based technique for auto-tuning the
bitwidths during training. A comprehensive explanation
of the HGQ method and its algorithm follows.
A. Quantization
Quantization is a map, henceforth referred to as f
q
,
from the set of real numbers
R
to a discrete subset
Q
≡
{
q
i
|
q
i
+1
> q
i
} ⊂
R
. For hardware efficiency, we ensure
that quantized weights and activations are represented
as fixed-point numbers, a common practice in hardware
for numerical representation. A fixed-point number can
be understood as an integer scaled by a factor of powers
of two. It is characterized by its bitwidth (total number
of bits) and the number of bits allocated for the integer
part. The inclusion of the sign bit in the integer part for
signed numbers varies by convention. In this context, we
adhere to the convention used in AMD
®
Vivado/Vitis
®
HLS, which includes the sign bit in the integer part if
presents. We denote the bitwidth
b
∈
N
+
with
i
∈
Z
bits are dedicated to the integer part, and define
f
≡
b
−
i
as the number of fractional bits. For a signed fixed-
point number, its representable range is [
−
2
i
−
1
,
2
i
−
1
−
2
−
f
] with a step size of 2
−
f
. For an unsigned fixed-point
number, the range is [0, 2
i
−
2
−
f
] with the same step size.
One way of quantizing a real number into a signed
fixed-point number,
fixed<b,i>
, can be expressed as
f
q
(
x
) =