Gradient-based Automatic Mixed Precision
Quantization for Neural Networks On-Chip
Chang Sun
∗
1
,
2
, Thea K.
̊
Arrestad
1
, Vladimir Loncar
3
,
4
, Jennifer Ngadiuba
5
, Maria Spiropulu
2
1
ETH Z
̈
urich, Z
̈
urich, Switzerland,
2
California Institute of Technology, Pasadena, CA, USA
3
Massachusetts Institute of Technology, Cambridge, MA, USA
4
Institute of Physics Belgrade, Serbia
5
Fermi National Accelerator Laboratory, Batavia, IL, USA,
Email: (chang.sun, thea.aarrestad, vladimir.loncar, jennifer.ngadiuba, maria.spiropulu)@cern.ch
*: Corresponding author
Abstract
—Model size and inference speed at deployment time,
1
are major challenges in many deep learning applications. A
2
promising strategy to overcome these challenges is quantization.
3
However, a straightforward uniform quantization to very low
4
precision can result in significant accuracy loss. Mixed-precision
5
quantization, based on the idea that certain parts of the network
6
can accommodate lower precision without compromising perfor-
7
mance compared to other parts, offers a potential solution. In
8
this work, we present High Granularity Quantization (HGQ), an
9
innovative quantization-aware training method that could fine-
10
tune the per-weight and per-activation precision by making them
11
optimizable through gradient descent. This approach enables
12
ultra-low latency and low power neural networks on hardware
13
capable of performing arithmetic operations with an arbitrary
14
number of bits, such as FPGAs and ASICs. We demonstrate that
15
HGQ can outperform existing methods by a substantial margin,
16
achieving resource reduction by up to a factor of 20 and latency
17
improvement by a factor of 5 while preserving accuracy.
18
I. I
NTRODUCTION
19
Edge computing has significantly increased the importance
20
of real-time deep neural network inference on specialized
21
hardware [
1]. While the typical latency threshold for real-time
22
inference applications is
O
(1)
ms [ 2], [ 3], [ 4], certain domains
23
require sub-microsecond inference times. At the CERN Large
24
Hadron Collider (LHC) [
5], detectors generate hundreds of
25
terabytes of data every second from proton-proton collisions
26
occurring every 25 nanoseconds. This enormous data through-
27
put is reduced by the
trigger
, a hardware system filtering
28
data in real-time at the same rate. This detector subsystem
29
determines the fate of each collision event – whether it should
30
be preserved for offline processing or discarded – with a
31
decision-making latency ceiling at a few microseconds [
6],
32
[7]. The trigger’s accuracy is vital to retain only the interesting
33
events for physics studies, thereby managing the downstream
34
bandwidth effectively by reducing the data rate by two orders
35
of magnitude. The system consists of
O
(1000)
field pro-
36
grammable gate arrays (FPGAs), where several algorithms are
37
running in parallel on each FPGA. As a result, resources are
38
scarce and the spacial complexity of each algorithm needs to
39
be minimal. In anticipation of the LHC’s upgrade to the High
40
Luminosity-LHC (HL-LHC) [
8], which will increase data rates
41
and complexity by a factor of 1-2, machine learning techniques
42
are being actively explored to enhance the speed and accuracy
43
of the algorithms in the future trigger system [
6], [ 7]. However,
44
integrating demanding models under such strict resource and
45
latency constraints without compromising performance is a
46
hurdle. To satisfy the latency requirements, neural networks on
47
FPGAs for LHC physics experiments are usually fully unrolled
48
– all arithmetic operations are done by different components
49
in the circuit without overlapping – and pipelined to minimize
50
the latency and maximize the throughput at the cost of higher
51
resource consumption. Efforts in recent years have focused on
52
algorithmic efficiency, with strategies ranging from the design
53
of compact networks to weight pruning and quantization [
9],
54
[10].
55
Quantization is a model compression technique that converts
56
model parameters into lower-precision formats, resulting in
57
some performance degradation in exchange for a smaller
58
model size and/or faster inference. To quantize a neural
59
network, one can either reduce the precision of its parameters
60
after training or train the network directly with low precision
61
parameters. These two approaches are referred to as post-
62
training quantization (PTQ) and quantization-aware training
63
(QAT), respectively. While PTQ is computationally cheaper to
64
perform in general, it usually induces a more significant loss in
65
performance compared to QAT under the same compression
66
ratio. To aim for the best possible trade-off between model
67
performance and resource consumption, we follow the QAT
68
approach.
69
In this work, we introduce high-granularity quantization
70
(HGQ), a novel QAT method that optimizes the quantiza-
71
tion bitwidths during training using gradients, which enables
72
models to be quantized at arbitrary granularity. In contrast
73
to existing methods, where bitwidths for network parameters
74
are optimized in predefined, structured blocks, HGQ provides
75
more granular control over which parameters share the same
76
bitwidth. For models deployed with fully unrolled imple-
77
mentations like the ones used in the trigger systems, every
78
parameter in the network may have its unique bitwidth. We
79
illustrate the key difference between the HGQ method and
80
the conventional block-wise quantization methods in Figure
I.
81
Optimizing the bitwidths at higher granularity allows HGQ
82
1
to find better trade-offs between the model performance and
83
resource consumption. Furthermore, by optimizing these indi-
84
vidual bitwidths alongside the network using gradient descent,
85
the need for including the bitwidths as hyperparameters to be
86
optimized with iterative trainings is eliminated. Depending on
87
the specific task, we demonstrate that HGQ has the potential
88
to outperform other model compression methods and achieve
89
resource reduction by up to a factor of 20, and latency
90
improvement by a factor of 5 while preserving the model
91
performance.
92
A functional HGQ library has been developed with
93
Tensorflow
[11] and
Keras
[12], and we have released
94
it as a free and open-source software. The Vivado/Vitis
®
95
FPGA back-ends are supported through integration with
96
hls4ml
[13] – a software tool designed to facilitate the con-
97
version of machine learning algorithms into hardware designs,
98
which is specifically optimized for ultra-low latency deploy-
99
ment on FPGAs and application-specific integrated circuits
100
(ASICs)
1
through High-Level Synthesis (HLS). The HGQ
101
library guarantees an exact correspondence between the soft-
102
ware and firmware models, provided that no numeric overflow
103
occurs and intermediate values are exactly representable by the
104
floating-point datatype used in emulation.
105
The work presented here makes the following contributions:
106
•
We present a new algorithm for obtaining surrogate
107
gradients for the quantization bitwidths, derived from
108
both the loss function and the estimated model resource
109
consumption, enabling full gradient-based optimization of
110
bitwidths;
111
•
We propose a new metric named Effective Bit Operations
112
(EBOPs) for accurate estimation of a model’s on-chip
113
resource consumption;
114
•
We enable heterogeneous quantization of a specific model
115
at arbitrary granularity up to per-parameter level, aim-
116
ing to minimize hardware resource usage while pre-
117
serving the performance. This approach automatically
118
includes sparse pruning of the network parameters as their
119
bitwidths reach zero;
120
•
We have made the HGQ library easily accessible on-
121
line
2
, and user-friendly: A simple drop-in replacement
122
of the
Keras
layers makes it straightforward for users
123
to transform
Keras
models to their corresponding het-
124
erogeneously quantized versions;
125
•
We have added support for HGQ-trained models in the
126
hls4ml
tool, which converts these pre-trained quantized
127
models into highly-parallel FPGA firmware with HLS.
128
We ensure bit-level consistency between the software
129
model and the corresponding firmware, making the li-
130
brary safe and easy to use for non-experts;
131
•
compared to other state-of-the-art model compression
132
methods targeting ultra-low latency applications, we
133
demonstrate a resource reduction of up to 95% and an
134
improvement in latency of up to 5-fold by using HGQ,
135
1
https://github.com/fastmachinelearning/hls4ml
2
https://github.com/calad0i/HGQ
all while maintaining accuracy.
136
II. R
ELATED WORK
137
Quantization is a widely adopted method for compress-
138
ing deep neural networks (DNNs) for implementing them
139
on specialized hardware devices such as FPGAs or ASICs.
140
Previous studies have utilized extremely low precision quanti-
141
zation, such as binary or ternary, across networks to enhance
142
throughput and reduce latency. Binary quantization restricts
143
parameters to
α
×{−
1
,
1
}
(or
α
×{
0
,
1
}
in some convention),
144
and ternary to
α
× {−
1
,
0
,
1
}
, with
α
being a relatively
145
high-precision scaling factor. Key examples include DoReFa
146
Net [
14], ABC-net [
15], Binaryconnect [
16], XNOR-net [
17],
147
TWN [
18], TTQ [
19], and [
20]. While these methods could
148
achieve high model compression ratios, they come at the cost
149
of substantially reduced model performance compared to the
150
corresponding float-point baselines. Using the same principles
151
as binary networks, several studies have moved to multi-bit
152
network designs that represent weights through binary bases
153
and values, highlighted in works like [
21], [ 22], [ 15], [ 23],
154
[24], [ 25].
155
Many studies have investigated heterogeneous quantization
156
with layer/channel-specific precision to lessen the performance
157
loss due to quantization. In particular, HAQ [
26] and Au-
158
toQ [
27] utilize reinforcement learning to find optimal bitwidth
159
configurations. HAWQ, HAWQ-V2, PyHessian, Q-BERT, and
160
OBQ [
28], [ 29], [ 30], [ 31], [ 32] focus on optimizing bitwidths
161
with the second-order approximations of the loss function
162
around the unquantized optimal weights. DNAS [
33] and Au-
163
toQKeras [
34] optimize the bitwidths and network architecture
164
simultaneously. IN particular, DNAS uses stochastic sampling
165
to obtain a subnetwork from a super network, and AutoQKeras
166
employs gradient-free methods like Gaussian Process, Hyper-
167
band, and stochastic search for hyperparameter optimizations.
168
Similarly, Meta-ML [
35] applies iterative optimizations to var-
169
ious hyperparameters, including the bitwidths, weight pruning
170
strategy, and model architecture. FILM-QNN [
36] optimizes
171
weight and activation’s bitwidths in a manner conducive to
172
hardware efficiency for convolutional neural networks. For
173
each convolutional layer, it categorizes filters into two groups
174
of lower and higher precision based on the anticipated loss of
175
performance due to quantizing each filter, and arranges them
176
to utilize the on-board multipliers on FPGAs efficiently.
177
Heterogeneous quantization at sub-layer/channel granular-
178
ity is also studied by other works. RVQuant [
37], Bitsand-
179
Bytes [
38], SpQR [
39], and SqueezeLLM[
40] offloads a small
180
fraction of outlier weights to higher precision formats to
181
mitigate the performance degradation due to quantization.
182
These works primarily targets for the weight size reduction
183
of larger models, rather than efficient inference on specialized
184
hardwares like FPGAs or ASICs.
185
Pruning is another technique used to compress neural
186
networks. It involves the removal of weights in a network
187
that have minimal impact on the overall performance. This
188
concept was first introduced in [
41], and first applied to
189
neural networks in [
42]. The removal of weights is sometimes
190
2
Quantization
HGQ
Pruning
Quantization
+ Pruning
Fig. I. An illustration of the HGQ method on a dense network. Activation and weights of the network are shown in circles and lines, with the thickness indicating
the corresponding bitwidth. A line/circle is dropped when the corresponding value is pruned. Top left: baseline network with high precision throughout. Top
right: a layer-wise heterogeneously quantized network, e.g., trained with QKeras. Bottom right: a network that is both layer-wise heterogeneously quantized
and unstructured pruned. Bottom left: a network trained with HGQ with maximum granularity: Each weight and activation has its unique bitwidth. When a
bitwidth reaches zero, the corresponding values is effectively pruned.
formulated by pruning masks – binary valued tensors that are
191
multiplied with the weights to zero out the pruned ones.
192
Depending on how the pruned weights are arranged, pruning
193
can be categorized as structured or unstructured pruning.
194
Structured pruning removes weights in specific blocks or fol-
195
lowing certain patterns, usually in a hardware-friendly manner
196
to speed up the inference, as in [
43], [ 44], [ 45], [ 46], [ 47], [ 48].
197
On the other hand, unstructured pruning targets individual
198
weights for the best compression ratio, as in [
49], [ 50], [ 51],
199
[52], [ 53]. Semi-structured pruning targeting specific hardware
200
accelerators also exists, as in [
54], [ 48].
201
On methodology side, [
42], [ 32], [ 48], [ 46] use Hessian of
202
the loss function for determining the optimal pruning mask and
203
weight compensations post-training. [
45] formulates the prun-
204
ing mask creation as an optimal transport problem and then
205
relaxes it to be differentiable for training. [
47], [ 53] directly
206
use trainable pruning masks that are optimized along with the
207
weights. [
51], [ 50] remove weights with the small magnitude
208
iteratively during training. [
52] optimizes the pruning mask by
209
solving it as a constraint optimization problem during training
210
with the stochastic Frank-Wolfe algorithm. With a similar
211
objective to this work, [
43] solves the post-training pruning
212
problem with constraint programming to reduce the network’s
213
on-chip resource usage.
214
In this work, we consider pruning as a special form of
215
quantization, where the pruned weights are quantized with zero
216
bits. In this way, pruning is automatically done by optimizing
217
the quantization bitwidths during training.
218
Closely related to this work, the
QKeras
[34] framework
219
aims to train and optimize neural networks for deployment
220
on FPGAs and ASICs.
Qkeras
is developed on top of
221
Keras
and leverages
hls4ml
[55] for hardware deployment.
222
It enables training and optimization of neural networks with
223
hardware-friendly fixed-point numbers for both weights and
224
activations.
AutoQKeras
, a feature within
Qkeras
, enables
225
automatic adjustment of quantization settings for each layer
226
using gradient-free approaches. Brevitas [
56] serves as the
227
PyTorch [
57] equivalent of
Qkeras
, which is commonly used
228
in pair with the
FINN
and
FINN-R
frameworks from Xilinx
229
Research [
58], [ 59] for deploying on AMD
®
FPGAs.
230
III. H
IGH
G
RANULARITY
Q
UANTIZATION
231
In this work, we introduce High Granularity Quantization
232
(HGQ), a novel quantization approach with the unique capabil-
233
ity of optimizing the bitwidths in a quantized neural network
234
at arbitrary fine granularity – up to per-parameter level.
235
At the same time, it provides an accurate on-chip resource
236
usage estimation, and simultaneously optimizes the accuracy
237
and resource usage of the network in a hardware/software
238
co-design fashion. We begin this section by outlining the
239
fundamentals of quantization and quantization-aware training.
240
Then, we introduce a way to accurately estimate the on-chip
241
resource consumption of a model. Subsequently, we introduce
242
an innovative gradient-based technique for auto-tuning the
243
bitwidths during training. A comprehensive explanation of the
244
HGQ method and its algorithm follows.
245
A. Quantization
246
Quantization is a map, henceforth referred to as
f
q
, from
247
the set of real numbers
R
to a discrete subset
Q
≡
248
{
q
i
|
q
i
+1
> q
i
} ⊂
R
. For hardware efficiency, we ensure that
249
quantized weights and activations are represented as fixed-
250
point numbers, a common practice in hardware for numerical
251
representation. A fixed-point number can be understood as an
252
integer scaled by a factor of powers of two. It is characterized
253
by its bitwidth (total number of bits) and the number of bits
254
allocated for the integer part. The inclusion of the sign bit
255
3
in the integer part for signed numbers varies by convention.
256
In this context, we adhere to the convention used in AMD
®
257
Vivado/Vitis
®
HLS, which includes the sign bit in the integer
258
part if presents. We denote the bitwidth
b
∈
N
+
with
i
∈
Z
259
bits are dedicated to the integer part, and define
f
≡
b
−
i
as
260
the number of fractional bits. For a signed fixed-point number,
261
its representable range is
[
−
2
i
−
1
,
2
i
−
1
−
2
−
f
]
with a step size
262
of
2
−
f
. For an unsigned fixed-point number, the range is [0,
263
2
i
−
2
−
f
] with the same step size.
264
One way of quantizing a real number into a signed fixed-
265
point number,
fixed<b,i>
, can be expressed as
266
f
q
(
x
) =
(([
x
·
2
f
]
+ 2
b
−
1
mod 2
b
)
−
2
b
−
1
)
·
2
−
f
=
{
[
x
·
2
f
]
·
2
−
f
,
if
x
∈
[
−
2
i
−
1
,
2
i
−
1
−
2
−
f
]
overflow
otherwise
,
(1)
where
[
x
]
≡ b
x
+
c
with some
∈
[0
,
1)
and
f
≡
b
−
i
.
Note that setting
= 1
/
2
recovers conventional round to the
nearest neighbor rounding with midpoint round-up. Similarly
to the signed case, for an unsigned fixed-point number denoted
as
ufixed<b,i>
, a quantization procedure can be expressed
as
f
q
(
x
) =
([
x
·
2
f
]
mod 2
b
)
·
2
−
f
=
{
[
x
·
2
f
]
·
2
−
f
,
if
x
∈
[0
,
2
i
−
2
−
f
]
overflow
otherwise
.
(2)
In Eq. (
1) and (
2), “overflow” refers to that the value to be
267
quantized exceeds the representable range of the fixed-point
268
number, which then cause a cyclical wrap of the number to the
269
opposite end of the range. Although a quantization function
270
could be designed to adjust values outside the permissible
271
range to the closest valid value (i.e., clipping them into the
272
range), this approach is avoided in our work to reduce resource
273
and latency overhead. Instead, by selecting an optimal set
274
of quantization parameters, we ensure that the all numbers
275
produced during inference falls into the representable range to
276
avoid overflow.
277
In our approach, we track only the number of fractional
bits
f
of the fixed-point numbers during training for quantiza-
tion. Before deploying to hardware, we estimate the required
number of integer bits
i
to avoid overflow. This task is
trivial for weights, as their values are fixed after training. For
intermediate accumulator and activation values, we employ a
calibration dataset to gauge the extremes (both maximum and
minimum) the values might assume. This process involves
running the dataset through the network and logging the
extreme quantized values (
v
q
min
,
v
q
max
), from which we can
determine the necessary integer bitwidth without the sign bit
i
′
using
i
′
= max(
b
log
2
|
v
q
max
|c
+ 1
,
d
log
2
|
v
q
min
|e
)
(3)
and obtain the integer bitwidth
i
by add back the sign bit when
278
necessary:
i
=
i
′
+1
for signed fixed-point numbers, and
i
=
i
′
279
for unsigned fixed-point numbers.
280
By ensuring the calibration dataset accurately reflects the
281
input data distribution the network will encounter after de-
282
ployment, we can avoid the overflows in inference time. For
283
extra safety, one may add extra margins to the computed
284
ranges to account for potential outliers in the input data. This
285
method thus eliminates the need to consider the representable
286
ranges of the given quantizer during the training phase, and
287
the quantization function during training can now be expressed
288
as:
289
f
q
(
x
) =
[
x
·
2
f
]
·
2
−
f
=
b
(
x
+
)
·
2
f
c·
2
−
f
.
(4)
Without loss of generality, we assume
= 1
/
2
for the
290
rest of this section and recover the conventional midpoint
291
round-up rounding. This assumption will not affect any of the
292
conclusions drawn in this work.
293
B. Quantization-Aware Training
294
Quantization-aware training (QAT) trains neural networks
295
by applying quantization directly during the training phase.
296
Previous works, e.g. [
34], demonstrate that QAT significantly
297
mitigates the performance degradation caused by post-training
298
quantization. In this work, we adopt the same QAT scheme
299
utilized in [
34] for our HGQ method. Specifically, we employ
300
the straight-through estimator (STE) [
60] for quantization of
301
weights and activations, which quantizes the values during
302
the forward pass while acts as an identity for computing the
303
gradients in the backward pass.
304
C. FPGA resource consumption estimation
305
A common metric for estimating on-chip resource usage
306
in FPGAs is Bit Operations (BOPs) proposed in [
61]. BOPs
307
quantify the resource consumption by counting the number of
308
bits involved in all operations performed during the network’s
309
forward pass. For two numbers declared in bitwidths
b
i
and
310
b
j
, the number of BOPs is
b
i
·
b
j
for a multiplication operation,
311
and the resultant number’s bitwidth for an addition operation.
312
While BOPs could be a good resource indicator in many cases,
313
it falls short in accurately reflecting resource consumption for
314
unrolled neural networks on specialized hardwares. The major
315
discrepancies arise from the following two points:
316
1) Declaring a constant as a fixed-point format number of
b
317
bits does not necessary mean that all
b
bits are used. For
318
instance, a weight of 0.5 in an 8-bit fixed-point format
319
only uses 1 bit instead of 8 bits, and counting it as 8 for
320
BOPs computation leads to an inaccurate resource usage
321
estimation.
322
2) BOPs tends to overestimate the resource consumption
323
of accumulation operations compared to multiplications.
324
Generally, most of the multiplication operations in neural
325
networks are between a fixed constant and a variable as
326
a part of vector-dot-products. Consider a single multipli-
327
cation involving two numbers of each of
b
i
and
b
j
bits
328
where the first number is a constant: When unrolled, this
329
operation is often decomposed into an accumulation of
330
∼
(
b
i
−
1)
shifted numbers each of
b
j
bits on hardwares.
331
4
By BOPs definition, this would be count as approximately
332
b
j
·
(
b
i
−
1)+
b
2
i
operations in accumulation, which is much
333
greater than
b
i
·
b
j
in general.
334
To address this discrepancy and offer a more precise esti-
335
mation of on-chip resource usage, we propose a novel metric,
336
Effective Bit Operations (EBOPs). For computing EBOPs,
337
the bitwidth used for constants is not the declared bitwidth,
338
but the number of bits enclosed by non-zero bits in binary
339
form. For instance, a weight represented as
001xx1000
340
will be counted as 4 bits instead of 8 bits. This approach
341
ensures that the resource consumption is not overestimated
342
by the declared bitwidth. If multiple weights share the same
343
multiplier (e.g., partially unrolling), the bitwidth of that weight
344
group is defined by the number of bits enclosed by the most
345
and least significant non-zero bits in that weight group. For
346
simplicity, we consider only the absolute values of parameters
347
when computing the bitwidths.
348
To address the second issue, we let the accumulation of
N
349
shifted numbers, each of
b
bits, to be count as
N
·
b
EBOPs.
350
As a result, EBOPs contributed by a multiplication inside an
351
accumulation chain (e.g., inside a vector dot product) is still
352
the product of the operands’ bitwidths, as the accumulation
353
operation of the resultant number is already implicitly counted.
354
Hence, EBOPs effectively count only the BOPs conducted
355
during multiplicative processes in a network with the modified
356
bitwidth definition. Let
M
=
{{
i, j
}
n
}
be the set of all
357
multiplication operations between operands with bitwidths
b
i
358
and
b
j
. The total number of EBOPs can then be expressed as
359
EBOPs =
∑
i,j
∈M
b
i
·
b
j
.
(5)
Experimental findings validate EBOPs as a reliable estima-
360
tor for on-chip resource consumption, which closely mirrors
361
a linear combination of (look-up tables) LUT and digital
362
signal processors (DSPs) usages. Detailed results are discussed
363
in Section
V. To get an accurate resource estimation from
364
EBOPs, one should only include operations that will be
365
executed in parallel. For instance, different inputs fed to the
366
same multiplier through a buffer should be counted only
367
once. Additionally, this estimation does not include overhead
368
from non-multiplication-accumulation processes (e.g., buffers,
369
logic switches, array indexing). For a complete resource usage
370
estimation, one need to estimate them separately in other
371
means and add these additional resource consumption to the
372
EBOPs estimation.
373
D. Gradient-based optimization of bitwidths
374
To obtain a fully-unrolled quantized neural network with
375
minimum usage on-chip, we want the ability to optimize the
376
bitwidth of each individual weight and activations. However,
377
as the number of bitwidths to be optimized would exceed the
378
number of trainable parameters in the original network in this
379
way, we propose the use of a gradient-based method to handle
380
this vast parameter space. Nonetheless, direct optimization
381
of these bitwidths via gradients is not possible due to their
382
discreteness and the lack of gradients on them. Therefore, we
383
address two main issues:
a)
make the discrete bitwidths opti-
384
mizable with a gradient; and
b)
estimate surrogate gradients
385
for these bitwidths.
386
1) Optimize discrete bitwidths with gradient:
The first issue
387
can be addressed by treating the discrete bitwidths similar to
388
the discrete weights in a quantized network. In particular, we
389
store the number of fractional bit in floating-point, and apply
390
the STE to them as it is done for the weights during training.
391
We follow the STE implementation used in
QKeras
:
392
ste(
x
) =
x
+ sg([
x
]
−
x
)
,
(6)
where the
stop gradient
operation
sg :
R
→
R
acts as an
393
identity function in the forward pass and a zero function in
394
backward pass. In this way, the bitwidths can be optimized if
395
they have gradients attached to them.
396
2) Surrogate gradient for bitwidths:
To address the second
397
issue, we first consider some parameter
x
(e.g., weight or
398
activation) in the network and its corresponding quantizer
399
f
q
(
·
)
. If the number is quantized with
f
fractional bits, its
400
associated quantization error
δ
f
can be expressed as follows:
401
δ
f
≡
x
−
f
q
(
x
) =
x
−
[
x
·
2
f
]
·
2
−
f
.
(7)
During training, we assume
x
to be a random variable fol-
402
lowing some smooth distribution
D
x
. We further assume that
403
the variance of
D
x
is significantly larger than the quantization
404
error
δ
f
in such a way that one can view the quantization
405
error’s distribution as an uniform distribution:
406
δ
f
∼
Uniform(
−
2
−
f
−
1
·
,
2
−
f
−
1
)
.
(8)
Let the loss of the network be
L
, and express the gradient
407
of
f
with respect to
L
as
408
∂
L
∂f
=
∂
L
∂δ
f
·
∂δ
f
∂f
.
(9)
In this expression, the first term
∂
L
∂δ
f
can be obtained
409
trivially with backpropagation. The second term
∂δ
f
∂f
is not
410
well-defined, as
f
can only take integer values for a properly
411
defined quantizer and thus has no gradient. As a solution to
412
this, we propose a surrogate gradient method that assigns a
413
gradient to
f
only on integer values.
414
We now express the loss as a function of the weights
θ
415
and all the quantization errors
δ
,
L
(
θ
,
δ
)
. We further assume
416
that the loss function is sensitive to the magnitude of the
417
quantization errors, but not the signs, i.e.
L
(
θ
,
|
δ
|
)
with
|
δ
|
418
being the element-wise absolute value of
δ
.
419
For a parameter
x
∼ D
x
to be quantized with
f
∈
Z
420
fractional bits, the corresponding absolute quantization error
421
is
|
δ
f
| ≡ |
x
−
f
q
f
(
x
)
| ∼
Uniform(0
,
2
−
f
−
1
)
. By increasing
f
422
by one, we obtain the absolute quantization error
|
δ
f
+1
|
as a
423
function of
f
and
|
δ
f
|
:
424
|
δ
f
+1
|
=
{
|
δ
f
|
|
δ
f
|≤
2
−
f
−
2
2
−
f
−
1
−|
δ
f
| |
δ
f
|
>
2
−
f
−
2
.
(10)
5
A straight forward way to obtain the gradient of
|
δ
f
|
with
425
respect to
f
is to use the finite difference approximation
426
∂
|
δ
f
|
∂f
←|
δ
f
+1
|−|
δ
f
|
.
(11)
However, as the absolute quantization error is bounded by
427
a geometric sequence of
2
−
f
−
1
, using a linear difference
428
for approximation may be suboptimal. Instead, we use the
429
following heuristic expression to approximate the gradient,
430
which recovers Eq. (
11) at the limit of
|
δ
f
+1
|→|
δ
f
|
:
431
∂
|
δ
f
|
∂f
←
log
|
δ
f
+1
|
|
δ
f
|
·|
δ
f
|
.
(12)
Expressing the ratio of
|
δ
f
+1
|
and
|
δ
f
|
as a function of
|
δ
f
|
,
432
we have
433
|
δ
f
+1
|
|
δ
f
|
=
{
1
|
δ
f
|≤
2
−
f
−
2
2
−
f
−
1
|
δ
f
|
−
1
|
δ
f
|
>
2
−
f
−
2
.
(13)
Though one may get a surrogate gradient by combining
434
Eq. (
12) and Eq. (
13), the using the local relations as ex-
435
pressed in Eq. (
13) between
|
δ
f
+1
|
and
|
δ
f
|
would lead
436
to a loss (gradient) landscape for
f
with extensive high-
437
frequency components that is hard to optimize. To mitigate
438
this issue, we smooth out the loss (gradient) landscape by
439
taking the expectation of the first term of Eq. (
12) over
440
|
δ
f
|∼
Uniform(0
,
2
−
f
−
1
)
:
441
E
|
δ
f
|
[
log
|
δ
f
+1
|
|
δ
f
|
]
=
−
log 2
.
(14)
By substituting Eq. (
14) into Eq. (
12), and add a
sign(
δ
f
)
442
term on both hand sides, we have
443
∂δ
f
∂f
←−
log 2
·
δ
f
.
(15)
Hence, the forward pass of the quantizer, with respect to one
444
input value
x
and its fractional bitwidth
f
, can be expressed as
445
in Algorithm
1. The backward pass is the auto-differentiation
446
of the forward pass with the stop-gradient operations.
447
Algorithm 1:
Quantizer forward pass
Data:
x
: the input value;
f
: the fractional bitwidth
Result:
x
q
: the differentiable, quantized value of
x
with fractional bitwidth
f
fp
f
←
ste(
f
fp
)
;
x
q
←
sg([
x
·
2
f
]
·
2
−
f
)
;
δ
←
sg(
x
−
x
q
)
;
δ
←
sg(
δ
+ ln 2
·
f
·
δ
)
−
ln 2
·
f
·
δ
;
x
q
←
x
−
δ
;
return
x
q
As quantization results in higher loss values in general, the
448
gradients propagated from the loss function to the bitwidths
449
tend to increase them. To optimize for the on-chip resource
450
usage and latency, we introduce regularization terms that
451
encourage for smaller bitwidths.
452
EBOPs introduced in Section
III-C
provides a good resource
453
estimation. However, as it involves non-differentiable bit-
454
counting for the weights and requires the min/max of the
455
intermediate values in the network to be known, it cannot
456
be directly used during training. Instead, we use
EBOPs,
457
an approximated form of EBOPs computed with estimated
458
bitwidths, as the regularization terms during training. In par-
459
ticular, we use max(
i
′
+
f,
0
) as the bitwidths for both weights
460
and bias to evaluate
EBOPs during training.
461
To evaluate the integer bitwidth without the sign bit
i
′
during
462
training for some activations’ bitwidth, we utilize the min/max
463
values realized by the corresponding activations within the
464
same epoch, and evaluate
i
′
with Eq. (
3). For the weights,
i
′
is
465
also evaluated by Eq. (
3), but with the min/max values being
466
the minimum and maximum weights corresponding to it. With
467
f
being directly available during training, we can evaluate the
468
approximated bitwidth and compute
EBOPs for each training
469
step. Indeed,
EBOPs is the upper bound of EBOPs if the
470
min/max values used are accurate, as
f
serves as the upper
471
bound of the actual number of fractional bits enclosed by non-
472
zero bits.
473
EBOPs is incorporated to the loss function with as a
regularization term with a coefficient
β
∈
R
+
to balance
the trade-off between the model performance and on-chip
resource usage. Moreover, as there are values in networks
that are not involved in any multiplicative operations, such
as the last-layer’s outputs or inputs to non-linear activations,
we apply another L-1 regularization with a coefficient
γ
∈
R
+
to the bitwidths to keep them from growing indefinitely and
consuming excessive resources. Hence, the final loss function
is given by
L
=
L
base
+
β
·
EBOPs +
γ
·
L1
norm
,
(16)
with the surrogate gradients from the loss function directed
474
attached to the bitwidths as described in Algorithm
1.
475
As all additional gradients introduced in this section only
476
apply to the bitwidths, the loss landscape of the network’s
477
weights remains unperturbed compared to that of networks
478
with static quantization parameters.
479
3) Gradient for bitwidths with multiple parameters:
Denote
480
the collection of parameters sharing the same bitwidth a
481
parameter group,
g
. In experiments, we noticed that if we
482
increase the size of a parameter group while keeping the same
483
β
, the corresponding bitwidth is more likely to collapse to
484
zero. To mitigate this, we normalize the gradient from the
485
regularization terms on
f
by
1
/
√
||
g
||
based on empirical
486
observations. Here,
||
g
||
denotes the number of parameters
487
in
g
. This normalization makes the optimization more stable
488
with respect to the size of the parameter groups.
489
4) Connection to Pruning:
From Eq. (
4), it is observable
490
that the quantized value is constantly zero if
−
·
2
−
f
≤
x <
491
(1
−
)
·
2
−
f
, or equivalently,
|
x
|
<
2
−
f
−
1
when
=
1
2
. As
492
f
∈
Z
, a sufficiently small
f
will cause the corresponding
493
parameters in the network to be constantly zero, which is
494
equivalent to have those parameters pruned. Assigning a
495
distinct bitwidth to each parameter group in the network
496
6
through HGQ thus automatically prunes the network during
497
training in such a way that takes both model performance
498
and resource consumption into account. When the granularity
499
for quantization is set to per-parameter, a fully unstructured
500
pruning is automatically performed.
501
Listing 1
K
ERAS MODEL EXAMPLE
502
from
tensorflow.keras.layers
import
Input, Dense
503
504
inp = Input((16,))
505
out = Dense(64, activation=‘relu’)(out)
506
out = Dense(32, activation=‘relu’)(out)
507
out = Dense(32, activation=‘relu’)(out)
508
out = Dense(5, activation=‘linear’)(out)
509
510
keras_model = Model(inp, out)
511
512
Listing 2
HGQ
MODEL EXAMPLE
513
from
tensorflow.keras.layers
import
Input
514
from
HGQ import
HQuantize, HDense
515
516
inp = Input((16,))
517
out = HQuantize(name=‘inp_q’, beta=beta)(out)
518
out = HDense(64, activation=‘relu’, beta=beta)(out)
519
out = HDense(32, activation=‘relu’, beta=beta)(out)
520
out = HDense(32, activation=‘relu’, beta=beta)(out)
521
out = HDense(5, activation=‘linear’, beta=beta)(out)
522
523
hgq_model = Model(inp, out)
524
525
IV. T
HE
H
IGH
G
RANULARITY
Q
UANTIZATION
526
F
RAMEWORK
527
The HGQ algorithm is available as a user-friendly Python
528
library similar to
QKeras
. It functions as an advanced
529
quantization API built on top of
Keras
, while leveraging
530
hls4ml
for the downstream model deployment on chips.
531
This framework facilitates automatic conversion of
keras
532
models into
hls4ml
models, ensuring bit-accuracy as per the
533
specifications of a dataset defined by the user without requiring
534
any further manual configuration.
535
HGQ is engineered to carry out automatic quantization on
536
all compatible layers according to the
EBOPs regularization
537
factor,
β
, and the L-1 regularization factor,
γ
. This approach
538
eliminates the necessity for users to fine-tune quantization
539
parameters for individual modules or undergo multiple training
540
cycles to identify the best quantization scheme.
541
The HGQ framework provides drop-in replacements for the
542
most commonly used
Keras
layers, making it straightforward
543
to rewrite a standard
Keras
model to an HGQ model with
544
minimal adjustments. For instance, as demonstrated in List-
545
ing 1 and 2, converting a
Keras
model to its HGQ counterpart
546
primarily involves substituting existing layers with their HGQ
547
alternatives, along with the inclusion of an additional layer to
548
quantize the input values. The HGQ framework provides two
549
categories of layers: Heterogeneous (
H-
) layers, which accept
550
an additional parameter,
beta
, to manage the layer’s resource
551
usage regularization strength based on
EBOPs, and Passive
552
(
P-
) layers, which serve to relay metadata without performing
553
quantization. The
H-
layers also allow for layer-specific kernel
554
and activation quantizer configurations for more fine-grained
555
controls. Though manual bitwidth configuration should not be
556
required in most cases, the user may still opt to specify the
557
bitwidths for specific layers if necessary.
558
Beyond quantization-aware training, the framework intro-
559
duces a convenient intermediate model representation, “proxy
560
models” for converting a trained
Keras
model to a
hls4ml
561
project. This feature accommodates both HGQ and
QKeras
562
models, automating the creation and enforcement of the
563
hls4ml
’s quantization configurations for precise conversions.
564
Furthermore, the proxy model facilitates bit-accurate emula-
565
tion of the compiled
hls4ml
model, aiding in debugging
566
and validating the
hls4ml
model’s performance before de-
567
velopment. As this emulation correctly models the overflow
568
behavior of the fixed-point numbers, it is still accurate in
569
case of overflows due to limited bitwidths. Though, when the
570
intermediate values are quantized with a high bitwidth, the
571
emulation may have errors at machine epsilon level due to the
572
use of floating-point numbers in the emulation.
573
V. R
ESULTS
574
To evaluate the performance of the HGQ method, we train
575
and evaluate models on two classification tasks – one for
576
physics experiment and one for computer vision – and one
577
regression task for physics experiment: jet tagging at the
578
LHC [
34], SVHN digit classification [
62], and muon tracking
579
at the LHC [
63], respectively.
580
To demonstrate the trade-off between the accuracy (or
581
resolution for regression tasks) and resource usage of the
582
models, we methodically adjusted the
β
factor for each task
583
during training to map out the Pareto Fronts. For each training
584
run, we initialize all layers with a notably small
β
, which is
585
then gradually increased through the training. Meanwhile, we
586
maintained the
γ
value fixed at
2.e-6
for all experiments to
587
avert the risk of diverging bitwidths for some parameters.
588
After each epoch, we record the validation accuracy (or
589
resolution) and
EBOPs, and maintain all model’s checkpoints
590
that are on the Pareto Front defined by these two metrics.
591
Post-training, we use the entire training and validation sets as
592
the calibration dataset to determine the required bitwidths and
593
evaluate the exact EBOPs for all checkpointed models. Subse-
594
quently, we compute the test accuracy (resolution) for all the
595
models, and then obtain their on-chip resource consumptions
596
after performing the place-and-route phase with Vivado/Vitis
®
.
597
A. Resource Estimation via EBOPs
598
We first demonstrate that EBOPs is a good estimator for
599
on-chip resource consumption. We consider these types of
600
major resources on an AMD
®
FPGA chip: flip-flops (FFs,
601
sometime referred to as registers), LUTs, DSPs, and onboard
602
memories (BRAMs and URAMs). When designing an unrolled
603
neural network for ultra low latency applications like the
604
hardware triggers for LHC experiments, the limiting resources
605
are usually either LUTs or DSPs. Empirically, for models
606
synthesized with Vivado/Vitis
®
HLS, operations consisting
607
of larger bitwidths are more likely to consume DSPs, while
608
7
operations with smaller bitwidths are more likely to consume
609
LUTs. In our experiments, we observed that the EBOPs
610
roughly predict a linear combination of the LUTs and DSPs
611
consumption, namely,
EBOP
≈
LUT+55
×
DSP
for models
612
synthesized with parallel IO, i.e., intermediate values in the
613
model are directly wired between layers/modules with no extra
614
buffer in between.
615
In Figure
II, we demonstrate this relationship between
616
EBOPs and the actual on-chip resource consumption. Data
617
points shown in this figure are from the models presented later
618
in this section for the aforementioned three tasks. Although
619
the relationship is not exact, we can still make a reasonable
620
estimation of the resource usage based on EBOPs. Also, this
621
linear relation suggests that treating one DSP as approximately
622
55 LUTs could be a practical approximation when comparing
623
resource usage across different models. It is important to note
624
that EBOPs primarily account for operations involving vector
625
dot product-like operations between constants and variables.
626
Therefore, if other kind of operations significantly contribute
627
to the on-chip resource consumption, EBOPs will underes-
628
timate the overall resource consumption. For instance, the
629
SVHN classifier models shown in Figure
II synthesized with
630
stream IO, which requires additional buffers for intermediate
631
values, have higher actual resource consumptions than what
632
EBOPs predicts.
633
10
2
10
3
10
4
10
5
EBOP
10
2
10
3
10
4
10
5
LUT + 55×DSP
Jet Classifier
SVHN Classifier
Muon Tracker
Fig. II. The relationship between EBOPs and the post place-and-route
resource consumption. Data points shown in this figure are from models
presented later in this section for the three tasks. The EBOPs roughly
predicts a linear combination of the LUTs and DSPs consumption for models
synthesized with parallel IO.
B. Jet Classification at the LHC
634
We conducted a comparison of the classification accuracy,
635
latency, and on-chip resource utilization of models trained with
636
HGQ against various quantized models from earlier researches
637
for this task.
638
0.1k
1k
10k
100k
Resource (LUT + 55
DSP)
71
72
73
74
75
76
77
Accuracy (%)
BF
BP
BH
Q6
QE
QB
LogicNets JSC-M
LogicNets JSC-L
BP-DSP-RF=2
MetaML-
q
= 1%
MetaML-
q
= 4%
SymbolNet
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
HGQ-c1
HGQ-c2
Fig. III. Accuracy versus resource consumptions of the jet tagging models.
Note that models with different DSP and LUT usage could land on the same
point on this plot due to the linear combination of DSPs and LUTs.
We use the dataset from [
66] to classify jets – collimated
639
showers of particles from quark and gluon decay at collider
640
physics experiments – into five classes based on their origi-
641
nating particle: single quark (q), single gluon (g), W and Z
642
bosons decaying to two quarks, and top (t) quark decaying to
643
two quarks and a heavier bottom quark. The inputs for each jet
644
are 16 scalar values representing physics-motivated high-level
645
features. The model architecture employed is based on the full
646
precision baseline model described in the original work [
34],
647
which is a 4-layer fully connected neural network. The exact
648
architecture is shown in Figure
VI in extended data.
649
We summarize the performance and resource usage of all
650
models we compared in Table
I and visualize them in Fig-
651
ure III. The following models shown are cited from [
34]: Base-
652
line Full (BF), Baseline Pruned (BP), Baseline Heterogeneous
653
(BH), Quantized 6-bit (Q6), AutoQKeras Energy Optimized
654
(QE), and AutoQKeras Bits Optimized (QB) models. All of
655
these models, except for BF and BP, are trained quantization-
656
aware. Hyperparameter optimizations with Gaussian Process
657
are applied to the AutoQKeras models to achieve low re-
658
source consumption. LogicNets JSC-M and JSC-L are cited
659
from [
64], where the networks are co-designed to use on-
660
chip LUTs efficiently. BP-DSP-RF=2 [
43] is a neural network
661
implemented in
QKeras
with a reuse factor (i.e., how many
662
times a multiplier logic unit may be used for inferencing one
663
sample) of two, which is pruned to reduce DSP usage while
664
preserving accuracy by formulating the trade-off as a knapsack
665
problem. For MetaML-
α
q
=1%
and MetaML-
α
q
=4%
[35],
666
iterative searches through model architecture and quantiza-
667
tion/pruning configurations are performed to achieve bet-
668
ter accuracy-resource trade-offs. SymbolNet [
65] leverages a
669
8
TABLE I
A
CCURACY
,
RESOURCE CONSUMPTION
,
LATENCY
,
AND INITIATION INTERVALS
(II
S
)
OF THE JET TAGGING MODELS
. R
ESOURCE REPORTED FOR
HGQ
MODELS ARE AFTER PLACE
-
AND
-
ROUTE WITH AN
AMD
®
V
IRTEX
®
U
LTRA
S
CALE
+
™
XCVU9P FPGA.
HGQ
MODELS OUTPERFORMS THE BASELINE
MODELS BY A LARGE MARGIN IN ALL ACCURACY
,
RESOURCE CONSUMPTION
,
AND LATENCY
.
Model
Accuracy (%) Latency (cc)
DSP
(%)
LUT
(%)
FF (%)
II (cc)
BF [ 34]
74.4
9 (45 ns)
56.0 (1,826) 4.09 (48,321) 0.8 (20,132)
1
BP [ 34]
74.8
14 (70 ns)
7.7 (526)
1.49 (17,577) 0.4 (10,548)
1
BH [
34]
73.2
14 (70 ns)
1.3 (88)
1.34 (15,802) 0.3 (8,108)
1
Q6 [ 34]
74.8
11 (55 ns)
1.8 (124)
3.36 (39,782) 0.3 (8,128)
1
QE [
34]
72.3
11 (55 ns)
1.0 (66)
0.77 (9,149)
0.1 (1,781)
1
QB [
34]
71.9
14 (70 ns)
1.0 (69)
0.95 (11,193) 0.1 (1,771)
1
LogicNets JSC-M [
64]
70.6
N/A
0 (0)
1.22 (14,428) 0.02 (440)
1
LogicNets JSC-L [
64]
71.8
5 (13 ns)
0 (0)
3.21 (37,931) 0.03 (810)
1
BP-DSP-RF=2 [
43]
76.3
21 (105 ns)
2.6 (175)
0.47 (5,504) 0.13 (3,036)
2
MetaML-
α
q
=1%
[35]
75.6
9 (45 ns)
0.7 (50)
0.57 (6,698)
N/A
1
MetaML-
α
q
=4%
[35]
72.8
8 (40 ns)
0.2 (23)
0.57 (7,224)
N/A
1
SymbolNet [
65]
71.
2 (10 ns)
<
0.1 (3)
0.01 (177)
<
0.01 (109)
1
HGQ-1
76.4
6 (30 ns)
0.50 (34)
0.53 (6,236) 0.05 (1253)
1
HGQ-2
75.9
4 (20 ns)
0.09 (6)
0.27 (3,162)
0.02 (550)
1
HGQ-3
75.0
4 (20 ns)
0.07 (5)
0.13 (1,540)
0.02 (370)
1
HGQ-4
73.9
3 (15 ns)
0.00 (0)
0.05 (565)
0.01 (140)
1
HGQ-5
72.5
2 (10 ns)
0.00 (0)
0.04 (468)
0.01 (131)
1
HGQ-6
71.0
2 (10 ns)
0.00 (0)
0.02 (256)
0.00 (66)
1
HGQ-c1
76.3
8 (40 ns)
0.26 (18)
0.50 (5,899) 0.09 (2,072)
1
HGQ-c2
74.2
3 (15 ns)
0.00 (0)
0.06 (678)
0.01 (172)
1
gradient-based method for neural symbolic regression. It also
670
uses an adaptive dynamic pruning scheme to reduce on-chip
671
resource consumption while maintaining the accuracy.
672
The HGQ trained models, HGQ 1 through 6, are taken
673
from the same training in which
β
is gradually increased.
674
The models is initialized with 2 fractional bits for activations,
675
and a bitwidth of 2 excluding the sign bit for the weights.
676
This model is fully unrolled, and per-parameter quantization is
677
applied. Throughout the training process of
300
,
000
epochs,
β
678
is gradually increased from
10
−
6
to
10
−
4
. Due to the model’s
679
compact size, the entire training completes in
∼
4
hours on a
680
modern consumer GPU with a batch size of
33
,
200
.
681
As shown in Figure
III and Table
I, the HGQ approach
682
outperforms all previous works on quantized neural networks
683
by significant margins, both in terms of model accuracy and
684
resource usage. Depending on the working point, HGQ may
685
reduce the resource consumption from 50% to up to 95%
686
while maintaining the same accuracy. When working with a
687
lower accuracy requirement, HGQ could also achieve similar
688
resource consumption to an optimized symbolic classifier.
689
We also studied the performance of the HGQ models
690
trained with fixed
β
values. In Figure
III and Table
I, these
691
correspond to
HGQ-c1
and
HGQ-c2
, which are trained with
692
fixed
β
’s of
2.1e-6
and
1.2e-5
, respectively. Both models
693
are trained for 5,000 epochs with the same batch size. By
694
comparing with the forementioned HGQ models, we observe
695
that models trained with either a constant or increasing
β
value
696
achieved comparable balance between accuracy and resource
697
consumption. This suggests that a lengthy training process
698
with a gradually increasing
β
value is not always necessary
699
for using HGQ to obtain optimal trade-offs between accuracy
700
and resource efficiency.
701
30k
100k
300k
Resource (LUT + 55
DSP)
88
89
90
91
92
93
94
Accuracy (%)
BP 14-bit
Q 7-bit
QP 7-bit
AQ
AQP
BP-DSP-RF=3
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
Fig. IV. Accuracy versus resource consumptions of the SVHN classifier
models. Note that models with different DSP and LUT consumption could
land on the same point on this plot due to taking a linear combination of
DSPs and LUTs.
C. SVHN Classifier
702
We also benchmark HGQ on a computer vision task and
703
compare it to previous state-of-the-art works [
62], [ 43] on real-
704
time inferences. We make use of the SVHN dataset [
67] which
705
consists of
32
×
32
RGB images of house numbers taken from
706
Google Street View, and the task is to classify the digit in the
707
center of the image into one of ten classes. The architecture
708
of the model is a LeNet-like [
68], convolution-dense network
709
taken from [
62], and the exact model architecture is shown in
710
9
TABLE II
A
CCURACY
,
RESOURCE USAGE
,
LATENCY
,
AND INITIATION INTERVALS OF THE
SVHN
CLASSIFIER MODELS
. R
EPORTED RESOURCE USAGE FOR
HGQ
MODELS ARE AFTER PLACE
-
AND
-
ROUTE WITH AN
AMD
®
V
IRTEX
®
U
LTRA
S
CALE
+
™
XCVU9P FPGA.
HGQ
MODELS OUTPERFORMS THE BASELINE
MODELS BOTH IN ACCURACY AND RESOURCE CONSUMPTION WHILE MAINTAINING COMPARABLE LATENCY
.
Model
Accuracy (%)
Latency (cc)
DSP (%)
LUT (%)
FF (%)
BRAM (%) II (cc)
BP 14-bit [
62]
93.
1,035 (5.18
μ
s) 48.85 (3,341) 12.27 (145,089) 2.77 (65,482) 3.08 (66.5)
1,030
Q 7-bit [
62]
94.
1,034 (5.17
μ
s)
2.56 (175)
12.77 (150,981) 1.51 (35,628) 3.10 (67.0)
1,029
QP 7-bit [
62]
94.
1,035 (5.18
μ
s)
2.54 (174)
9.40 (111,152) 1.38 (32,554) 3.10 (67.0)
1,030
AQ [
62]
88.
1,059 (5.30
μ
s)
1.05 (72)
4.06 (48,027) 0.64 (15,242) 1.48 (32.5)
1,029
AQP [
62]
88.
1,059 (5.30
μ
s)
1.02 (70)
3.28 (38,795) 0.63 (14,802) 1.39 (30.5)
1,029
BP-DSP-RF=3 [
43]
92.
? (43.58
μ
s)
17.76 (1,215) 5.01 (59,279) 1.97 (46,584) 35.88 (1,550) 35.88
HGQ-1
93.9
1050 (5.25
μ
s)
0.85 (58)
5.87 (69,407)
1.18 (27853)
1.48 (32.0)
1029
HGQ-2
93.1
1061 (5.31
μ
s)
0.44 (30)
4.00 (47,314)
0.87 (20582)
1.30 (28.0)
1029
HGQ-3
91.9
1058 (5.29
μ
s)
0.22 (15)
3.39 (40,032)
0.76 (18087)
1.09 (23.5)
1029
HGQ-4
90.9
1059 (5.30
μ
s)
0.19 (13)
2.91 (34,435)
0.73 (17261)
1.04 (22.5)
1029
HGQ-5
89.9
1056 (5.28
μ
s)
0.15 (10)
2.60 (30,766)
0.64 (15205)
0.97 (21.0)
1029
HGQ-6
88.8
1056 (5.28
μ
s)
0.09 (6)
2.37 (27,982)
0.62 (14736)
0.97 (21.0)
1029
Figure
VII in extended data.
711
We summarize the performance and resource usage of
712
all models we compared in Table
II and visualize them in
713
Figure
IV. In the table and figure, AutoQkeras Pruned (AQP),
714
AutoQkeras (AQ), QKeras Pruned 7-bit (QP 7-bit), QKeras 7-
715
bit (Q 7-bit), and Baseline Pruned (BP 14-bit) are taken from
716
[62]. All of these models except BP are trained quantization
717
aware with
QKeras
. In particular, AQP, QP, and BP are
718
pruned to a sparsity of 50% iteratively with a magnitude-based
719
method during training. AQP and AQ are heterogeneously
720
quantized models, where the quantization configurations are
721
optimized through AutoQKeras’s hyperparameter tuner with
722
Gaussian Process. BP-DSP-RF=3 is cited from [
43], where
723
the network is implemented in
QKeras
with a reuse factor
724
of three, and it formulates the trade-off between accuracy and
725
DSP usage as a snappack problem to perform optimal pruning.
726
The HGQ trained models, HGQ 1 though 6, are taken from
727
a single training run during which the
β
value is gradually in-
728
creased. For training, we initialize the model with 6 fractional
729
bits for activations, and a bitwidth of 6 for weights excluding
730
the sign bit. The
β
value is systematically increased from
10
−
7
731
to
10
−
4
over approximately
12
,
000
epochs. Completing this
732
training process requires
∼
10
hours on a modern consumer
733
GPU with a batch size of
2
,
048
.
734
As this model is too large to fit on-chip if fully unrolled, we
735
use the stream IO implementation in
hls4ml
. This partitions
736
the convolutional layers into smaller blocks by individual
737
kernel operations (i.e., partitioned by rows in the im2col
738
algorithm [
69]) and compute them once at a time at inference
739
time [
62]. Due to limitations of the current implementation,
740
intra-layer heterogeneous activation quantization cannot be uti-
741
lized with stream IO. Hence, while the weights are quantized
742
at the per-parameter granularity, activations are quantized in
743
layer-wise blocks. Nevertheless, HGQ still outperforms both
744
baselines by a considerable margin of up to 40% in resource
745
savings while maintaining similar accuracy and latency.
746
D. Muon Tracker
747
For this task, we compare the resolution, latency, and on-
748
chip resource consumption of the HGQ trained models to
749
models presented in [
63] on a regression task proposed in the
750
same work. The task involves predicting the incidence angle
751
of a simulated muon track in a particle detector. The inputs are
752
one
3
×
50
and two
3
×
50
binary-valued arrays, representing the
753
hits recorded in three detector stations. The output is a single
754
scalar value representing the angle in milliradians. We evaluate
755
the network’s performance in resolution, defined by the root-
756
mean-square of the angle’s reconstruction errors. Following
757
the same approach in [
63], we exclude outliers where the
758
absolute error is greater than 30 milliradians. The architecture
759
of the model is a multistage neural network taken from the
760
original work, and is shown in Figure
VIII
in extended data.
761
The results, including the performance and resource con-
762
sumption of the models trained with HGQ and the models
763
proposed in the original work, are presented in Table
III and
764
visualized in Figure
V. The Quantized with * fractional bits
765
(Qf*) models presented in [
63] are all trained quantization
766
aware with
QKeras
using manually tuned parameters, where
767
* stands for the number of fractional bits used in all network
768
parameters.
769
The HGQ trained models, HGQ 1 though 6, are taken
770
from a single training run during which the
β
value is
771
gradually increased. We initialize the model with 6 fractional
772
bits for activations, and a bitwidth of 6 excluding the sign
773
bit for the weights. The model is fully unrolled, and the
774
quantization is applied at the per-parameter granularity. The
β
775
value is systematically increased from
3.e-6
to
6.e-4
over
776
approximately
600
,
000
epochs, which takes
∼
16
hours on a
777
modern consumer GPU with a batch size of
16
,
384
.
778
The HGQ models consistently outperform the baseline mod-
779
els with a reduction in resource consumption of
40
∼
50
%,
780
while achieving the same or better resolution with comparable
781
latency.
782
VI. C
ONCLUSION AND
F
UTURE
W
ORK
783
In this work, we present HGQ, a novel method to optimize
784
quantized neural networks for real-time applications on FP-
785
GAs and possibly also ASICs. The HGQ approach enables
786
the optimization of the quantization bitwidths at arbitrary
787
granularity, up to per-parameter level, through a gradient-
788
10
TABLE III
R
ESOLUTION
,
RESOURCE CONSUMPTION
,
LATENCY
,
AND INITIATION INTERVALS OF THE
M
UON
T
RACKER MODELS
. T
HE RESOURCE USAGE REPORTED
FOR
HGQ
MODELS ARE AFTER PLACE
-
AND
-
ROUTE WITH AN
AMD
®
V
IRTEX
®
U
LTRA
S
CALE
+
™
XCVU13P FPGA.
HGQ
MODELS OUTPERFORMS THE
BASELINE MODELS BOTH ACCURACY AND RESOURCE CONSUMPTION FOR THIS TASK WHILE MAINTAINING COMPARABLE LATENCY
.
Model
Resolution (mrad) Latency (cc)
DSP (%)
LUT (%)
FF (%)
BRAM (%) II (cc)
Qf8 [
63]
1.95
17 (106.3 ns) 14.34 (1,762) 2.19 (37,867) 0.24 (8,443) 1.40 (37.5)
1
Qf7 [
63]
1.97
11 (68.8 ns) 11.30 (1,389) 2.02 (34,848) 0.16 (5,433) 1.40 (37.5)
1
Qf6 [
63]
2.04
13 (81.3 ns)
2.64 (324)
3.16 (54,638) 0.19 (6,525) 1.40 (37.5)
1
Qf5 [
63]
2.15
11 (68.8 ns)
0.72 (88)
2.32 (40,039) 0.10 (3,419) 1.40 (37.5)
1
Qf4 [
63]
2.45
10 (62.5 ns)
0.20 (24)
1.65 (28,526) 0.09 (2,954) 1.40 (37.5)
1
Qf3 [
63]
2.78
9 (56.3 ns)
0.02 (2)
1.25 (21,682) 0.06 (2,242) 1.40 (37.5)
1
HGQ-1
1.95
11 (68.8 ns)
4.25 (522)
2.28 (39,413) 0.17 (6,043) 0.93 (25.0)
1
HGQ-2
2.00
11 (68.8 ns)
1.25 (154)
1.99 (34,460) 0.15 (5,263) 0.93 (25.0)
1
HGQ-3
2.09
12 (75.0 ns)
0.55 (68)
1.44 (24,941) 0.14 (4,677) 1.40 (37.5)
1
HGQ-4
2.20
13 (81.3 ns)
0.33 (41)
1.25 (21,557) 0.14 (4,699) 1.40 (37.5)
1
HGQ-5
2.39
10 (62.5 ns)
0.22 (27)
0.98 (16,918) 0.07 (2,484) 1.40 (37.5)
1
HGQ-6
2.63
12 (75.0 ns)
0.08 (10)
0.77 (13,306) 0.10 (3,429) 0.93 (25.0)
1
20k
50k
100k
200k
Resource (LUT + 55
DSP)
2.0
2.2
2.4
2.6
2.8
Resolution (mrad)
Qf8
Qf7
Qf6
Qf5
Qf4
Qf3
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
Fig. V. Resolution versus resource consumptions of the muon tracking
models. Note that models with different DSP and LUT consumption could
land on the same point on this plot as a result of taking the linear combination
of DSPs and LUTs.
based approach that is conscious of both resource usage and
789
loss minimization. By maximally leveraging the ability of
790
the specialized hardware to perform fully heterogeneous com-
791
putations, we are able minimizes the resource consumption
792
by the models while maintaining the model performance. In
793
particular, our findings show that HGQ achieves up to a 95%
794
reduction in resource consumption compared to leading com-
795
pression techniques on FPGAs without performance degra-
796
dation. We further demonstrate that a single training session
797
with HGQ is sufficient to explore a broad spectrum of trade-
798
offs between model performance and resource utilization,
799
efficiently recovering the Pareto Frontier, thereby rendering the
800
model optimization process both more efficient and effective.
801
Moreover, we introduce EBOPs, a metric providing an accu-
802
rate estimation of the final on-chip resource consumption of a
803
model as a linear combination of LUTs and DSPs, allowing
804
for efficient software-hardware co-designs.
805
To facilitate adoption, we have developed a user-friendly
806
library that simplifies the application of this method. The
807
library offers an easy-to-use interface for defining and training
808
of quantized neural networks with our method. Through inter-
809
facing with
hls4ml
, HGQ provides bit-accurate conversions
810
from software to FPGA firmware models without the need of
811
manual intervention, significantly simplifying and streamlining
812
the workflow from training to deployment.
813
We are looking forward to developing new neural networks
814
based triggers for the CERN LHC experiments, with the
815
HGQ+hls4ml workflow for the incoming data taking period.
816
With the increased hardware efficiency, we hope to enable
817
more complex models to be deployed on the trigger system,
818
which could lead to more accurate trigger decisions. For future
819
improvements of this method, we hope to develop a differential
820
latency estimator for the models. Though lower bitwidths
821
generally result in lower latencies, this relation does not hold
822
in some cases, like when the HLS backend switches between
823
DSP and LUT based arithmetic implementations. Also, we
824
would like to explore the possibility of having separate LUT
825
and DSP consumption estimators, as the resource constraints
826
of the two are not always interchangeable depending on the
827
specific application.
828
VII. C
ODE
A
VAILABILITY
829
We have made our library publicly available under the
830
Apache 2.0 license at
https://www.github.com/calad0i/HGQ
.
831
The scripts to reproduce the results in this paper are also avail-
832
able at
https://www.github.com/calad0i/HGQ-demos
under the
833
Apache 2.0 license.
834
VIII. D
ATA
A
VAILABILITY
835
The data used for training and evaluation in this work are
836
all publicly available datasets. The jet tagging dataset is avail-
837
able at
https://dx.doi.org/10.5281/zenodo.2603255
. The SVHN
838
dataset is available at
http://ufldl.stanford.edu/housenumbers/
.
839
The muon tracking dataset is available at
https://dx.doi.org/10.
840
57967/hf/2084
. Results shown in this work can be reproduced
841
using the code available at
https://www.github.com/calad0i/
842
HGQ-demos
.
843
11