of 15
FERMILAB-PUB-24-0213-CMS
CaltechAUTHORS:10.7907/hq8jd-rhg30
Gradient-based Automatic Mixed Precision Quantization
for Neural Networks On-Chip
Chang Sun,
1, 2,
Thea K.
̊
Arrestad,
1
Vladimir Loncar,
3, 4
Jennifer Ngadiuba,
5
and Maria Spiropulu
2
1
ETH Zurich (Zurich, Switzerland)
2
California Institute of Technology (CA, USA)
3
Massachusetts Institute of Technology (MA, USA)
4
Institute of Physics Belgrade (Belgrade, Serbia)
5
Fermi National Accelerator Laboratory (IL, USA)
Model size and inference speed at deployment time, are major challenges in many deep learn-
ing applications. A promising strategy to overcome these challenges is quantization. However, a
straightforward uniform quantization to very low precision can result in significant accuracy loss.
Mixed-precision quantization, based on the idea that certain parts of the network can accommodate
lower precision without compromising performance compared to other parts, offers a potential solu-
tion. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-
aware training method that could fine-tune the per-weight and per-activation precision by making
them optimizable through gradient descent. This approach enables ultra-low latency and low power
neural networks on hardware capable of performing arithmetic operations with an arbitrary number
of bits, such as FPGAs and ASICs. We demonstrate that HGQ can outperform existing methods by
a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement
by a factor of 5 while preserving accuracy.
I. INTRODUCTION
Edge computing has significantly increased the impor-
tance of real-time deep neural network inference on spe-
cialized hardware [1]. While the typical latency thresh-
old for real-time inference applications is
O
(1) ms [2–4],
certain domains require sub-microsecond inference times.
At the CERN Large Hadron Collider (LHC) [5], detec-
tors generate hundreds of terabytes of data every second
from proton-proton collisions occurring every 25 nanosec-
onds. This enormous data throughput is reduced by
the
trigger
, a hardware system filtering data in real-
time at the same rate. This detector subsystem de-
termines the fate of each collision event – whether it
should be preserved for offline processing or discarded –
with a decision-making latency ceiling at a few microsec-
onds [6, 7]. The trigger’s accuracy is vital to retain only
the interesting events for physics studies, thereby man-
aging the downstream bandwidth effectively by reducing
the data rate by two orders of magnitude. The system
consists of
O
(1000) field programmable gate arrays (FP-
GAs), where several algorithms are running in parallel
on each FPGA. As a result, resources are scarce and the
spacial complexity of each algorithm needs to be mini-
mal. In anticipation of the LHC’s upgrade to the High
Luminosity-LHC (HL-LHC) [8], which will increase data
rates and complexity by a factor of 1-2, machine learn-
ing techniques are being actively explored to enhance the
speed and accuracy of the algorithms in the future trigger
E-mail: chsun@cern.ch
system [6, 7]. However, integrating demanding models
under such strict resource and latency constraints with-
out compromising performance is a hurdle. To satisfy
the latency requirements, neural networks on FPGAs
for LHC physics experiments are usually fully unrolled
– all arithmetic operations are done by different compo-
nents in the circuit without overlapping – and pipelined
to minimize the latency and maximize the throughput
at the cost of higher resource consumption. Efforts in
recent years have focused on algorithmic efficiency, with
strategies ranging from the design of compact networks
to weight pruning and quantization [9, 10].
Quantization is a model compression technique that
converts model parameters into lower-precision formats,
resulting in some performance degradation in exchange
for a smaller model size and/or faster inference. To quan-
tize a neural network, one can either reduce the pre-
cision of its parameters after training or train the net-
work directly with low precision parameters. These two
approaches are referred to as post-training quantization
(PTQ) and quantization-aware training (QAT), respec-
tively. While PTQ is computationally cheaper to perform
in general, it usually induces a more significant loss in
performance compared to QAT under the same compres-
sion ratio. To aim for the best possible trade-off between
model performance and resource consumption, we follow
the QAT approach.
In this work, we introduce high-granularity quanti-
zation (HGQ), a novel QAT method that optimizes
the quantization bitwidths during training using gra-
dients, which enables models to be quantized at arbi-
trary granularity. In contrast to existing methods, where
arXiv:2405.00645v2 [cs.LG] 8 Aug 2024
bitwidths for network parameters are optimized in pre-
defined, structured blocks, HGQ provides more granular
control over which parameters share the same bitwidth.
For models deployed with fully unrolled implementations
like the ones used in the trigger systems, every param-
eter in the network may have its unique bitwidth. We
illustrate the key difference between the HGQ method
and the conventional block-wise quantization methods in
Figure I. Optimizing the bitwidths at higher granularity
allows HGQ to find better trade-offs between the model
performance and resource consumption. Furthermore, by
optimizing these individual bitwidths alongside the net-
work using gradient descent, the need for including the
bitwidths as hyperparameters to be optimized with iter-
ative trainings is eliminated. Depending on the specific
task, we demonstrate that HGQ has the potential to out-
perform other model compression methods and achieve
resource reduction by up to a factor of 20, and latency
improvement by a factor of 5 while preserving the model
performance.
A functional HGQ library has been developed with
Tensorflow
[11] and
Keras
[12], and we have released it
as a free and open-source software. The Vivado/Vitis
®
FPGA back-ends are supported through integration with
hls4ml
[13] – a software tool designed to facilitate the
conversion of machine learning algorithms into hardware
designs, which is specifically optimized for ultra-low la-
tency deployment on FPGAs and application-specific in-
tegrated circuits (ASICs)[14] through High-Level Synthe-
sis (HLS). The HGQ library guarantees an exact corre-
spondence between the software and firmware models,
provided that no numeric overflow occurs and intermedi-
ate values are exactly representable by the floating-point
datatype used in emulation.
The work presented here makes the following contri-
butions:
We present a new algorithm for obtaining surro-
gate gradients for the quantization bitwidths, de-
rived from both the loss function and the esti-
mated model resource consumption, enabling full
gradient-based optimization of bitwidths;
We propose a new metric named Effective Bit
Operations (EBOPs) for accurate estimation of a
model’s on-chip resource consumption;
We enable heterogeneous quantization of a specific
model at arbitrary granularity up to per-parameter
level, aiming to minimize hardware resource usage
while preserving the performance. This approach
automatically includes sparse pruning of the net-
work parameters as their bitwidths reach zero;
We have made the HGQ library easily accessible
online[15], and user-friendly: A simple drop-in re-
placement of the
Keras
layers makes it straightfor-
ward for users to transform
Keras
models to their
corresponding heterogeneously quantized versions;
We have added support for HGQ-trained mod-
els in the
hls4ml
tool, which converts these
pre-trained quantized models into highly-parallel
FPGA firmware with HLS. We ensure bit-level con-
sistency between the software model and the cor-
responding firmware, making the library safe and
easy to use for non-experts;
compared to other state-of-the-art model compres-
sion methods targeting ultra-low latency applica-
tions, we demonstrate a resource reduction of up
to 95% and an improvement in latency of up to 5-
fold by using HGQ, all while maintaining accuracy.
II. RELATED WORK
Quantization is a widely adopted method for com-
pressing deep neural networks (DNNs) for implement-
ing them on specialized hardware devices such as FP-
GAs or ASICs.
Previous studies have utilized ex-
tremely low precision quantization, such as binary or
ternary, across networks to enhance throughput and re-
duce latency. Binary quantization restricts parameters
to
α
×{−
1
,
1
}
(or
α
×{
0
,
1
}
in some convention), and
ternary to
α
×{−
1
,
0
,
1
}
, with
α
being a relatively high-
precision scaling factor. Key examples include DoReFa
Net [16], ABC-net [17], Binaryconnect [18], XNOR-
net [19], TWN [20], TTQ [21], and [22]. While these
methods could achieve high model compression ratios,
they come at the cost of substantially reduced model
performance compared to the corresponding float-point
baselines. Using the same principles as binary networks,
several studies have moved to multi-bit network designs
that represent weights through binary bases and values,
highlighted in works like [17, 23–27].
Many studies have investigated heterogeneous quan-
tization with layer/channel-specific precision to lessen
the performance loss due to quantization. In particular,
HAQ [28] and AutoQ [29] utilize reinforcement learning
to find optimal bitwidth configurations. HAWQ, HAWQ-
V2, PyHessian, Q-BERT, and OBQ [30–34] focus on
optimizing bitwidths with the second-order approxima-
tions of the loss function around the unquantized optimal
weights. DNAS [35] and AutoQKeras [36] optimize the
bitwidths and network architecture simultaneously. IN
particular, DNAS uses stochastic sampling to obtain a
subnetwork from a super network, and AutoQKeras em-
ploys gradient-free methods like Gaussian Process, Hy-
perband, and stochastic search for hyperparameter op-
timizations. Similarly, Meta-ML [37] applies iterative
optimizations to various hyperparameters, including the
bitwidths, weight pruning strategy, and model architec-
ture. FILM-QNN [38] optimizes weight and activation’s
bitwidths in a manner conducive to hardware efficiency
for convolutional neural networks. For each convolutional
layer, it categorizes filters into two groups of lower and
higher precision based on the anticipated loss of perfor-
mance due to quantizing each filter, and arranges them
to utilize the on-board multipliers on FPGAs efficiently.
Heterogeneous quantization at sub-layer/channel gran-
ularity is also studied by other works. RVQuant [39],
2
Quantization
HGQ
Pruning
Quantization
+ Pruning
FIG. I. An illustration of the HGQ method on a dense network. Activation and weights of the network are shown in circles
and lines, with the thickness indicating the corresponding bitwidth. A line/circle is dropped when the corresponding value is
pruned. Top left: baseline network with high precision throughout. Top right: a layer-wise heterogeneously quantized network,
e.g., trained with QKeras. Bottom right: a network that is both layer-wise heterogeneously quantized and unstructured pruned.
Bottom left: a network trained with HGQ with maximum granularity: Each weight and activation has its unique bitwidth.
When a bitwidth reaches zero, the corresponding values is effectively pruned.
BitsandBytes [40], SpQR [41], and SqueezeLLM[42] of-
floads a small fraction of outlier weights to higher pre-
cision formats to mitigate the performance degradation
due to quantization. These works primarily targets for
the weight size reduction of larger models, rather than
efficient inference on specialized hardwares like FPGAs
or ASICs.
Pruning is another technique used to compress neural
networks. It involves the removal of weights in a network
that have minimal impact on the overall performance.
This concept was first introduced in [43], and first applied
to neural networks in [44]. The removal of weights is
sometimes formulated by pruning masks – binary valued
tensors that are multiplied with the weights to zero out
the pruned ones.
Depending on how the pruned weights are arranged,
pruning can be categorized as structured or unstruc-
tured pruning. Structured pruning removes weights in
specific blocks or following certain patterns, usually in
a hardware-friendly manner to speed up the inference,
as in [45–50]. On the other hand, unstructured pruning
targets individual weights for the best compression ratio,
as in [51–55]. Semi-structured pruning targeting specific
hardware accelerators also exists, as in [50, 56].
On methodology side, [34, 44, 48, 50] use Hessian of the
loss function for determining the optimal pruning mask
and weight compensations post-training. [47] formulates
the pruning mask creation as an optimal transport prob-
lem and then relaxes it to be differentiable for training.
[49, 55] directly use trainable pruning masks that are op-
timized along with the weights. [52, 53] remove weights
with the small magnitude iteratively during training. [54]
optimizes the pruning mask by solving it as a constraint
optimization problem during training with the stochastic
Frank-Wolfe algorithm. With a similar objective to this
work, [45] solves the post-training pruning problem with
constraint programming to reduce the network’s on-chip
resource usage.
In this work, we consider pruning as a special form
of quantization, where the pruned weights are quantized
with zero bits. In this way, pruning is automatically done
by optimizing the quantization bitwidths during training.
Closely related to this work, the
QKeras
[36] frame-
work aims to train and optimize neural networks for de-
ployment on FPGAs and ASICs.
Qkeras
is developed
on top of
Keras
and leverages
hls4ml
[57] for hardware
deployment. It enables training and optimization of neu-
ral networks with hardware-friendly fixed-point numbers
for both weights and activations.
AutoQKeras
, a feature
within
Qkeras
, enables automatic adjustment of quan-
tization settings for each layer using gradient-free ap-
proaches.
Brevitas [58] serves as the PyTorch [59] equivalent of
Qkeras
, which is commonly used in pair with the
FINN
and
FINN-R
frameworks from Xilinx Research [60, 61] for
deploying on AMD
®
FPGAs.
III. HIGH GRANULARITY QUANTIZATION
In this work, we introduce High Granularity Quanti-
zation (HGQ), a novel quantization approach with the
unique capability of optimizing the bitwidths in a quan-
tized neural network at arbitrary fine granularity – up
to per-parameter level. At the same time, it provides an
accurate on-chip resource usage estimation, and simulta-
3
neously optimizes the accuracy and resource usage of the
network in a hardware/software co-design fashion. We
begin this section by outlining the fundamentals of quan-
tization and quantization-aware training. Then, we in-
troduce a way to accurately estimate the on-chip resource
consumption of a model. Subsequently, we introduce an
innovative gradient-based technique for auto-tuning the
bitwidths during training. A comprehensive explanation
of the HGQ method and its algorithm follows.
A. Quantization
Quantization is a map, henceforth referred to as f
q
,
from the set of real numbers
R
to a discrete subset
Q
{
q
i
|
q
i
+1
> q
i
} ⊂
R
. For hardware efficiency, we ensure
that quantized weights and activations are represented
as fixed-point numbers, a common practice in hardware
for numerical representation. A fixed-point number can
be understood as an integer scaled by a factor of powers
of two. It is characterized by its bitwidth (total number
of bits) and the number of bits allocated for the integer
part. The inclusion of the sign bit in the integer part for
signed numbers varies by convention. In this context, we
adhere to the convention used in AMD
®
Vivado/Vitis
®
HLS, which includes the sign bit in the integer part if
presents. We denote the bitwidth
b
N
+
with
i
Z
bits are dedicated to the integer part, and define
f
b
i
as the number of fractional bits. For a signed fixed-
point number, its representable range is [
2
i
1
,
2
i
1
2
f
] with a step size of 2
f
. For an unsigned fixed-point
number, the range is [0, 2
i
2
f
] with the same step size.
One way of quantizing a real number into a signed
fixed-point number,
fixed<b,i>
, can be expressed as
f
q
(
x
) =

x
·
2
f

+ 2
b
1
mod 2
b

2
b
1

·
2
f
=
(

x
·
2
f

·
2
f
,
if
x
[
2
i
1
,
2
i
1
2
f
]
overflow
otherwise
,
(1)
where [
x
]
≡ ⌊
x
+
ε
with some
ε
[0
,
1) and
f
b
i
.
Note that setting
ε
= 1
/
2 recovers conventional round
to the nearest neighbor rounding with midpoint round-
up. Similarly to the signed case, for an unsigned fixed-
point number denoted as
ufixed<b,i>
, a quantization
procedure can be expressed as
f
q
(
x
) =

x
·
2
f

mod 2
b

·
2
f
=
(

x
·
2
f

·
2
f
,
if
x
[0
,
2
i
2
f
]
overflow
otherwise
.
(2)
In Eq. (1) and (2), “overflow” refers to that the value
to be quantized exceeds the representable range of the
fixed-point number, which then cause a cyclical wrap of
the number to the opposite end of the range. Although a
quantization function could be designed to adjust val-
ues outside the permissible range to the closest valid
value (i.e., clipping them into the range), this approach
is avoided in our work to reduce resource and latency
overhead. Instead, by selecting an optimal set of quan-
tization parameters, we ensure that the all numbers pro-
duced during inference falls into the representable range
to avoid overflow.
In our approach, we track only the number of frac-
tional bits
f
of the fixed-point numbers during train-
ing for quantization. Before deploying to hardware, we
estimate the required number of integer bits
i
to avoid
overflow. This task is trivial for weights, as their values
are fixed after training. For intermediate accumulator
and activation values, we employ a calibration dataset
to gauge the extremes (both maximum and minimum)
the values might assume. This process involves running
the dataset through the network and logging the extreme
quantized values (
v
q
min
,
v
q
max
), from which we can deter-
mine the necessary integer bitwidth without the sign bit
i
using
i
= max(
log
2
|
v
q
max
|⌋
+ 1
,
log
2
|
v
q
min
|⌉
)
(3)
and obtain the integer bitwidth
i
by add back the sign bit
when necessary:
i
=
i
+1 for signed fixed-point numbers,
and
i
=
i
for unsigned fixed-point numbers.
By ensuring the calibration dataset accurately reflects
the input data distribution the network will encounter
after deployment, we can avoid the overflows in inference
time. For extra safety, one may add extra margins to
the computed ranges to account for potential outliers in
the input data. This method thus eliminates the need to
consider the representable ranges of the given quantizer
during the training phase, and the quantization function
during training can now be expressed as:
f
q
(
x
) =

x
·
2
f

·
2
f
=
(
x
+
ε
)
·
2
f
⌋·
2
f
.
(4)
Without loss of generality, we assume
ε
= 1
/
2 for the
rest of this section and recover the conventional midpoint
round-up rounding. This assumption will not affect any
of the conclusions drawn in this work.
B. Quantization-Aware Training
Quantization-aware training (QAT) trains neural net-
works by applying quantization directly during the train-
ing phase. Previous works, e.g. [36], demonstrate that
QAT significantly mitigates the performance degradation
caused by post-training quantization. In this work, we
adopt the same QAT scheme utilized in [36] for our HGQ
method. Specifically, we employ the straight-through es-
timator (STE) [62] for quantization of weights and ac-
tivations, which quantizes the values during the forward
pass while acts as an identity for computing the gradients
in the backward pass.
4
C. FPGA resource consumption estimation
A common metric for estimating on-chip resource us-
age in FPGAs is Bit Operations (BOPs) proposed in [63].
BOPs quantify the resource consumption by counting the
number of bits involved in all operations performed dur-
ing the network’s forward pass. For two numbers de-
clared in bitwidths
b
i
and
b
j
, the number of BOPs is
b
i
·
b
j
for a multiplication operation, and the resultant num-
ber’s bitwidth for an addition operation. While BOPs
could be a good resource indicator in many cases, it falls
short in accurately reflecting resource consumption for
unrolled neural networks on specialized hardwares. The
major discrepancies arise from the following two points:
1. Declaring a constant as a fixed-point format num-
ber of
b
bits does not necessary mean that all
b
bits
are used. For instance, a weight of 0.5 in an 8-bit
fixed-point format only uses 1 bit instead of 8 bits,
and counting it as 8 for BOPs computation leads
to an inaccurate resource usage estimation.
2. BOPs tends to overestimate the resource consump-
tion of accumulation operations compared to mul-
tiplications. Generally, most of the multiplication
operations in neural networks are between a fixed
constant and a variable as a part of vector-dot-
products. Consider a single multiplication involv-
ing two numbers of each of
b
i
and
b
j
bits where
the first number is a constant: When unrolled, this
operation is often decomposed into an accumula-
tion of
(
b
i
1) shifted numbers each of
b
j
bits
on hardwares. By BOPs definition, this would be
count as approximately
b
j
·
(
b
i
1) +
b
2
i
operations
in accumulation, which is much greater than
b
i
·
b
j
in general.
To address this discrepancy and offer a more precise
estimation of on-chip resource usage, we propose a novel
metric, Effective Bit Operations (EBOPs). For comput-
ing EBOPs, the bitwidth used for constants is not the
declared bitwidth, but the number of bits enclosed by
non-zero bits in binary form. For instance, a weight rep-
resented as
001xx1000
will be counted as 4 bits instead
of 8 bits. This approach ensures that the resource con-
sumption is not overestimated by the declared bitwidth.
If multiple weights share the same multiplier (e.g., par-
tially unrolling), the bitwidth of that weight group is de-
fined by the number of bits enclosed by the most and
least significant non-zero bits in that weight group. For
simplicity, we consider only the absolute values of param-
eters when computing the bitwidths.
To address the second issue, we let the accumulation
of
N
shifted numbers, each of
b
bits, to be count as
N
·
b
EBOPs. As a result, EBOPs contributed by a multipli-
cation inside an accumulation chain (e.g., inside a vec-
tor dot product) is still the product of the operands’
bitwidths, as the accumulation operation of the resultant
number is already implicitly counted.
Hence, EBOPs effectively count only the BOPs con-
ducted during multiplicative processes in a network with
the modified bitwidth definition. Let
M
=
{{
i,j
}
n
}
be
the set of all multiplication operations between operands
with bitwidths
b
i
and
b
j
. The total number of EBOPs
can then be expressed as
EBOPs =
X
i,j
∈M
b
i
·
b
j
.
(5)
Experimental findings validate EBOPs as a reliable es-
timator for on-chip resource consumption, which closely
mirrors a linear combination of (look-up tables) LUT and
digital signal processors (DSPs) usages. Detailed results
are discussed in Section V. To get an accurate resource
estimation from EBOPs, one should only include opera-
tions that will be executed in parallel. For instance, dif-
ferent inputs fed to the same multiplier through a buffer
should be counted only once. Additionally, this estima-
tion does not include overhead from non-multiplication-
accumulation processes (e.g., buffers, logic switches, ar-
ray indexing). For a complete resource usage estimation,
one need to estimate them separately in other means and
add these additional resource consumption to the EBOPs
estimation.
D. Gradient-based optimization of bitwidths
To obtain a fully-unrolled quantized neural network
with minimum usage on-chip, we want the ability to op-
timize the bitwidth of each individual weight and acti-
vations. However, as the number of bitwidths to be op-
timized would exceed the number of trainable parame-
ters in the original network in this way, we propose the
use of a gradient-based method to handle this vast pa-
rameter space. Nonetheless, direct optimization of these
bitwidths via gradients is not possible due to their dis-
creteness and the lack of gradients on them. There-
fore, we address two main issues:
a)
make the discrete
bitwidths optimizable with a gradient; and
b)
estimate
surrogate gradients for these bitwidths.
1. Optimize discrete bitwidths with gradient
The first issue can be addressed by treating the discrete
bitwidths similar to the discrete weights in a quantized
network. In particular, we store the number of fractional
bit in floating-point, and apply the STE to them as it is
done for the weights during training. We follow the STE
implementation used in
QKeras
:
ste(
x
) =
x
+ sg([
x
]
x
)
,
(6)
where the
stop gradient
operation sg :
R
R
acts as
an identity function in the forward pass and a zero func-
5
tion in backward pass. In this way, the bitwidths can be
optimized if they have gradients attached to them.
2. Surrogate gradient for bitwidths
To address the second issue, we first consider some pa-
rameter
x
(e.g., weight or activation) in the network and
its corresponding quantizer f
q
(
·
). If the number is quan-
tized with
f
fractional bits, its associated quantization
error
δ
f
can be expressed as follows:
δ
f
x
f
q
(
x
) =
x

x
·
2
f

·
2
f
.
(7)
During training, we assume
x
to be a random variable
following some smooth distribution
D
x
. We further as-
sume that the variance of
D
x
is significantly larger than
the quantization error
δ
f
in such a way that one can
view the quantization error’s distribution as an uniform
distribution:
δ
f
Uniform(
2
f
1
·
,
2
f
1
)
.
(8)
Let the loss of the network be
L
, and express the gra-
dient of
f
with respect to
L
as
L
∂f
=
L
∂δ
f
·
∂δ
f
∂f
.
(9)
In this expression, the first term
L
∂δ
f
can be obtained
trivially with backpropagation. The second term
∂δ
f
∂f
is
not well-defined, as
f
can only take integer values for a
properly defined quantizer and thus has no gradient. As a
solution to this, we propose a surrogate gradient method
that assigns a gradient to
f
only on integer values.
We now express the loss as a function of the weights
θ
and all the quantization errors
δ
,
L
(
θ
,
δ
). We further as-
sume that the loss function is sensitive to the magnitude
of the quantization errors, but not the signs, i.e.
L
(
θ
,
|
δ
|
)
with
|
δ
|
being the element-wise absolute value of
δ
.
For a parameter
x
∼ D
x
to be quantized with
f
Z
fractional bits, the corresponding absolute quantization
error is
|
δ
f
| ≡ |
x
f
q
f
(
x
)
| ∼
Uniform(0
,
2
f
1
). By in-
creasing
f
by one, we obtain the absolute quantization
error
|
δ
f
+1
|
as a function of
f
and
|
δ
f
|
:
|
δ
f
+1
|
=
(
|
δ
f
|
|
δ
f
|≤
2
f
2
2
f
1
−|
δ
f
| |
δ
f
|
>
2
f
2
.
(10)
A straight forward way to obtain the gradient of
|
δ
f
|
with respect to
f
is to use the finite difference approxi-
mation
|
δ
f
|
∂f
←|
δ
f
+1
|−|
δ
f
|
.
(11)
However, as the absolute quantization error is bounded
by a geometric sequence of 2
f
1
, using a linear dif-
ference for approximation may be suboptimal. Instead,
we use the following heuristic expression to approximate
the gradient, which recovers Eq. (11) at the limit of
|
δ
f
+1
|→|
δ
f
|
:
|
δ
f
|
∂f
log
|
δ
f
+1
|
|
δ
f
|
·|
δ
f
|
.
(12)
Expressing the ratio of
|
δ
f
+1
|
and
|
δ
f
|
as a function of
|
δ
f
|
, we have
|
δ
f
+1
|
|
δ
f
|
=
(
1
|
δ
f
|≤
2
f
2
2
f
1
|
δ
f
|
1
|
δ
f
|
>
2
f
2
.
(13)
Though one may get a surrogate gradient by combin-
ing Eq. (12) and Eq. (13), the using the local relations
as expressed in Eq. (13) between
|
δ
f
+1
|
and
|
δ
f
|
would
lead to a loss (gradient) landscape for
f
with extensive
high-frequency components that is hard to optimize. To
mitigate this issue, we smooth out the loss (gradient)
landscape by taking the expectation of the first term of
Eq. (12) over
|
δ
f
|∼
Uniform(0
,
2
f
1
):
E
|
δ
f
|

log
|
δ
f
+1
|
|
δ
f
|

=
log 2
.
(14)
By substituting Eq. (14) into Eq. (12), and add a
sign(
δ
f
) term on both hand sides, we have
∂δ
f
∂f
←−
log 2
·
δ
f
.
(15)
Hence, the forward pass of the quantizer, with respect
to one input value
x
and its fractional bitwidth
f
, can be
expressed as in Algorithm 1. The backward pass is the
auto-differentiation of the forward pass with the stop-
gradient operations.
Algorithm 1:
Quantizer forward pass
Data:
x
: the input value;
f
: the fractional bitwidth
Result:
x
q
: the differentiable, quantized value of
x
with fractional bitwidth
f
fp
f
ste(
f
fp
);
x
q
sg([
x
·
2
f
]
·
2
f
);
δ
sg(
x
x
q
) ;
δ
sg(
δ
+ ln 2
·
f
·
δ
)
ln 2
·
f
·
δ
;
x
q
x
δ
;
return
x
q
As quantization results in higher loss values in gen-
eral, the gradients propagated from the loss function to
the bitwidths tend to increase them. To optimize for the
on-chip resource usage and latency, we introduce regu-
larization terms that encourage for smaller bitwidths.
6
EBOPs introduced in Section III C provides a good
resource estimation.
However, as it involves non-
differentiable bit-counting for the weights and requires
the min/max of the intermediate values in the network
to be known, it cannot be directly used during train-
ing. Instead, we use
EBOPs, an approximated form of
EBOPs computed with estimated bitwidths, as the reg-
ularization terms during training. In particular, we use
max(
i
+
f,
0) as the bitwidths for both weights and bias
to evaluate
EBOPs during training.
To evaluate the integer bitwidth without the sign bit
i
during training for some activations’ bitwidth, we utilize
the min/max values realized by the corresponding activa-
tions within the same epoch, and evaluate
i
with Eq. (3).
For the weights,
i
is also evaluated by Eq. (3), but with
the min/max values being the minimum and maximum
weights corresponding to it. With
f
being directly avail-
able during training, we can evaluate the approximated
bitwidth and compute
EBOPs for each training step. In-
deed,
EBOPs is the upper bound of EBOPs if the min/-
max values used are accurate, as
f
serves as the upper
bound of the actual number of fractional bits enclosed by
non-zero bits.
EBOPs is incorporated to the loss function with as a
regularization term with a coefficient
β
R
+
to balance
the trade-off between the model performance and on-chip
resource usage. Moreover, as there are values in networks
that are not involved in any multiplicative operations,
such as the last-layer’s outputs or inputs to non-linear
activations, we apply another L-1 regularization with a
coefficient
γ
R
+
to the bitwidths to keep them from
growing indefinitely and consuming excessive resources.
Hence, the final loss function is given by
L
=
L
base
+
β
·
EBOPs +
γ
·
L1
norm
,
(16)
with the surrogate gradients from the loss function di-
rected attached to the bitwidths as described in Algo-
rithm 1.
As all additional gradients introduced in this section
only apply to the bitwidths, the loss landscape of the
network’s weights remains unperturbed compared to that
of networks with static quantization parameters.
3. Gradient for bitwidths with multiple parameters
Denote the collection of parameters sharing the same
bitwidth a parameter group,
g
. In experiments, we no-
ticed that if we increase the size of a parameter group
while keeping the same
β
, the corresponding bitwidth is
more likely to collapse to zero. To mitigate this, we nor-
malize the gradient from the regularization terms on
f
by 1
/
p
||
g
||
based on empirical observations. Here,
||
g
||
denotes the number of parameters in
g
. This normaliza-
tion makes the optimization more stable with respect to
the size of the parameter groups.
4. Connection to Pruning
From Eq. (4), it is observable that the quantized value
is constantly zero if
ε
·
2
f
x <
(1
ε
)
·
2
f
, or equiv-
alently,
|
x
|
<
2
f
1
when
ε
=
1
2
. As
f
Z
, a sufficiently
small
f
will cause the corresponding parameters in the
network to be constantly zero, which is equivalent to have
those parameters pruned. Assigning a distinct bitwidth
to each parameter group in the network through HGQ
thus automatically prunes the network during training
in such a way that takes both model performance and
resource consumption into account. When the granu-
larity for quantization is set to per-parameter, a fully
unstructured pruning is automatically performed.
Listing 1. Keras model example
from tensorflow.keras.layers import Input, Dense
inp = Input((16,))
out = Dense(64, activation=‘relu’)(out)
out = Dense(32, activation=‘relu’)(out)
out = Dense(32, activation=‘relu’)(out)
out = Dense(5, activation=‘linear’)(out)
keras_model = Model(inp, out)
Listing 2. HGQ model example
from tensorflow.keras.layers import Input
from HGQ import HQuantize, HDense
inp = Input((16,))
out = HQuantize(name=‘inp_q’, beta=beta)(out)
out = HDense(64, activation=‘relu’, beta=beta)(out)
out = HDense(32, activation=‘relu’, beta=beta)(out)
out = HDense(32, activation=‘relu’, beta=beta)(out)
out = HDense(5, activation=‘linear’, beta=beta)(out)
hgq_model = Model(inp, out)
IV. THE HIGH GRANULARITY
QUANTIZATION FRAMEWORK
The HGQ algorithm is available as a user-friendly
Python library similar to
QKeras
. It functions as an
advanced quantization API built on top of
Keras
, while
leveraging
hls4ml
for the downstream model deployment
on chips. This framework facilitates automatic conver-
sion of
keras
models into
hls4ml
models, ensuring bit-
accuracy as per the specifications of a dataset defined by
the user without requiring any further manual configura-
tion.
HGQ is engineered to carry out automatic quantiza-
tion on all compatible layers according to the
EBOPs
regularization factor,
β
, and the L-1 regularization fac-
tor,
γ
. This approach eliminates the necessity for users
to fine-tune quantization parameters for individual mod-
ules or undergo multiple training cycles to identify the
best quantization scheme.
The HGQ framework provides drop-in replacements
for the most commonly used
Keras
layers, making it
straightforward to rewrite a standard
Keras
model to
7
an HGQ model with minimal adjustments. For instance,
as demonstrated in Listing 1 and 2, converting a
Keras
model to its HGQ counterpart primarily involves substi-
tuting existing layers with their HGQ alternatives, along
with the inclusion of an additional layer to quantize the
input values. The HGQ framework provides two cate-
gories of layers: Heterogeneous (
H-
) layers, which accept
an additional parameter,
beta
, to manage the layer’s re-
source usage regularization strength based on
EBOPs,
and Passive (
P-
) layers, which serve to relay metadata
without performing quantization. The
H-
layers also al-
low for layer-specific kernel and activation quantizer con-
figurations for more fine-grained controls. Though man-
ual bitwidth configuration should not be required in most
cases, the user may still opt to specify the bitwidths for
specific layers if necessary.
Beyond quantization-aware training, the framework in-
troduces a convenient intermediate model representation,
“proxy models” for converting a trained
Keras
model
to a
hls4ml
project. This feature accommodates both
HGQ and
QKeras
models, automating the creation and
enforcement of the
hls4ml
’s quantization configurations
for precise conversions. Furthermore, the proxy model fa-
cilitates bit-accurate emulation of the compiled
hls4ml
model, aiding in debugging and validating the
hls4ml
model’s performance before development. As this emu-
lation correctly models the overflow behavior of the fixed-
point numbers, it is still accurate in case of overflows due
to limited bitwidths. Though, when the intermediate val-
ues are quantized with a high bitwidth, the emulation
may have errors at machine epsilon level due to the use
of floating-point numbers in the emulation.
V. RESULTS
To evaluate the performance of the HGQ method, we
train and evaluate models on two classification tasks –
one for physics experiment and one for computer vision
– and one regression task for physics experiment: jet tag-
ging at the LHC [36], SVHN digit classification [64], and
muon tracking at the LHC [65], respectively.
To demonstrate the trade-off between the accuracy (or
resolution for regression tasks) and resource usage of the
models, we methodically adjusted the
β
factor for each
task during training to map out the Pareto Fronts. For
each training run, we initialize all layers with a notably
small
β
, which is then gradually increased through the
training. Meanwhile, we maintained the
γ
value fixed at
2.e-6
for all experiments to avert the risk of diverging
bitwidths for some parameters.
After each epoch, we record the validation accuracy (or
resolution) and
EBOPs, and maintain all model’s check-
points that are on the Pareto Front defined by these two
metrics. Post-training, we use the entire training and val-
idation sets as the calibration dataset to determine the
required bitwidths and evaluate the exact EBOPs for all
checkpointed models. Subsequently, we compute the test
accuracy (resolution) for all the models, and then obtain
their on-chip resource consumptions after performing the
place-and-route phase with Vivado/Vitis
®
.
A. Resource Estimation via EBOPs
We first demonstrate that EBOPs is a good estima-
tor for on-chip resource consumption. We consider these
types of major resources on an AMD
®
FPGA chip: flip-
flops (FFs, sometime referred to as registers), LUTs,
DSPs, and onboard memories (BRAMs and URAMs).
When designing an unrolled neural network for ultra
low latency applications like the hardware triggers for
LHC experiments, the limiting resources are usually ei-
ther LUTs or DSPs. Empirically, for models synthesized
with Vivado/Vitis
®
HLS, operations consisting of larger
bitwidths are more likely to consume DSPs, while oper-
ations with smaller bitwidths are more likely to consume
LUTs. In our experiments, we observed that the EBOPs
roughly predict a linear combination of the LUTs and
DSPs consumption, namely, EBOP
LUT + 55
×
DSP
for models synthesized with parallel IO, i.e., intermediate
values in the model are directly wired between layers/-
modules with no extra buffer in between.
In Figure II, we demonstrate this relationship between
EBOPs and the actual on-chip resource consumption.
Data points shown in this figure are from the models pre-
sented later in this section for the aforementioned three
tasks. Although the relationship is not exact, we can
still make a reasonable estimation of the resource usage
based on EBOPs. Also, this linear relation suggests that
treating one DSP as approximately 55 LUTs could be
a practical approximation when comparing resource us-
age across different models. It is important to note that
EBOPs primarily account for operations involving vec-
tor dot product-like operations between constants and
variables. Therefore, if other kind of operations signif-
icantly contribute to the on-chip resource consumption,
EBOPs will underestimate the overall resource consump-
tion. For instance, the SVHN classifier models shown in
Figure II synthesized with stream IO, which requires ad-
ditional buffers for intermediate values, have higher ac-
tual resource consumptions than what EBOPs predicts.
B. Jet Classification at the LHC
We conducted a comparison of the classification accu-
racy, latency, and on-chip resource utilization of models
trained with HGQ against various quantized models from
earlier researches for this task.
We use the dataset from [68] to classify jets – colli-
mated showers of particles from quark and gluon decay
at collider physics experiments – into five classes based on
their originating particle: single quark (q), single gluon
(g), W and Z bosons decaying to two quarks, and top
(t) quark decaying to two quarks and a heavier bottom
8
TABLE I. Accuracy, resource consumption, latency, and initiation intervals (IIs) of the jet tagging models. Resource reported
for
HGQ
models are after place-and-route with an AMD
®
Virtex
®
UltraScale+
XCVU9P FPGA.
HGQ
models outperforms
the baseline models by a large margin in all accuracy, resource consumption, and latency.
Model
Accuracy (%) Latency (cc) DSP (%)
LUT (%)
FF (%)
II (cc)
BF [36]
74.4
9 (45 ns)
56.0 (1,826) 4.09 (48,321) 0.8 (20,132)
1
BP [36]
74.8
14 (70 ns)
7.7 (526) 1.49 (17,577) 0.4 (10,548)
1
BH [36]
73.2
14 (70 ns)
1.3 (88)
1.34 (15,802) 0.3 (8,108)
1
Q6 [36]
74.8
11 (55 ns)
1.8 (124) 3.36 (39,782) 0.3 (8,128)
1
QE [36]
72.3
11 (55 ns)
1.0 (66)
0.77 (9,149) 0.1 (1,781)
1
QB [36]
71.9
14 (70 ns)
1.0 (69)
0.95 (11,193) 0.1 (1,771)
1
LogicNets JSC-M [66]
70.6
N/A
0 (0)
1.22 (14,428) 0.02 (440)
1
LogicNets JSC-L [66]
71.8
5 (13 ns)
0 (0)
3.21 (37,931) 0.03 (810)
1
BP-DSP-RF=2 [45]
76.3
21 (105 ns)
2.6 (175)
0.47 (5,504) 0.13 (3,036)
2
MetaML-
α
q
=1% [37]
75.6
9 (45 ns)
0.7 (50)
0.57 (6,698)
N/A
1
MetaML-
α
q
=4% [37]
72.8
8 (40 ns)
0.2 (23)
0.57 (7,224)
N/A
1
SymbolNet [67]
71.
2 (10 ns)
<
0.1 (3)
0.01 (177)
<
0.01 (109)
1
HGQ-1
76.4
6 (30 ns)
0.50 (34)
0.53 (6,236) 0.05 (1253)
1
HGQ-2
75.9
4 (20 ns)
0.09 (6)
0.27 (3,162) 0.02 (550)
1
HGQ-3
75.0
4 (20 ns)
0.07 (5)
0.13 (1,540) 0.02 (370)
1
HGQ-4
73.9
3 (15 ns)
0.00 (0)
0.05 (565)
0.01 (140)
1
HGQ-5
72.5
2 (10 ns)
0.00 (0)
0.04 (468)
0.01 (131)
1
HGQ-6
71.0
2 (10 ns)
0.00 (0)
0.02 (256)
0.00 (66)
1
HGQ-c1
76.3
8 (40 ns)
0.26 (18)
0.50 (5,899) 0.09 (2,072)
1
HGQ-c2
74.2
3 (15 ns)
0.00 (0)
0.06 (678)
0.01 (172)
1
10
2
10
3
10
4
10
5
EBOP
10
2
10
3
10
4
10
5
LUT + 55×DSP
Jet Classifier
SVHN Classifier
Muon Tracker
FIG. II. The relationship between EBOPs and the post place-
and-route resource consumption. Data points shown in this
figure are from models presented later in this section for the
three tasks. The EBOPs roughly predicts a linear combina-
tion of the LUTs and DSPs consumption for models synthe-
sized with parallel IO.
quark. The inputs for each jet are 16 scalar values repre-
senting physics-motivated high-level features. The model
architecture employed is based on the full precision base-
line model described in the original work [36], which is a
4-layer fully connected neural network.
0.1k
1k
10k
100k
Resource (LUT + 55
DSP)
71
72
73
74
75
76
77
Accuracy (%)
BF
BP
BH
Q6
QE
QB
LogicNets JSC-M
LogicNets JSC-L
BP-DSP-RF=2
MetaML-
q
= 1%
MetaML-
q
= 4%
SymbolNet
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
HGQ-c1
HGQ-c2
FIG. III. Accuracy versus resource consumptions of the jet
tagging models. Note that models with different DSP and
LUT usage could land on the same point on this plot due to
the linear combination of DSPs and LUTs.
We summarize the performance and resource usage of
all models we compared in Table I and visualize them
in Figure III. The following models shown are cited
from [36]: Baseline Full (BF), Baseline Pruned (BP),
Baseline Heterogeneous (BH), Quantized 6-bit (Q6), Au-
toQKeras Energy Optimized (QE), and AutoQKeras Bits
Optimized (QB) models. All of these models, except
9
for BF and BP, are trained quantization-aware. Hy-
perparameter optimizations with Gaussian Process are
applied to the AutoQKeras models to achieve low re-
source consumption. LogicNets JSC-M and JSC-L are
cited from [66], where the networks are co-designed to use
on-chip LUTs efficiently. BP-DSP-RF=2 [45] is a neural
network implemented in
QKeras
with a reuse factor (i.e.,
how many times a multiplier logic unit may be used for
inferencing one sample) of two, which is pruned to reduce
DSP usage while preserving accuracy by formulating the
trade-off as a knapsack problem. For MetaML-
α
q
=1%
and MetaML-
α
q
=4% [37], iterative searches through
model architecture and quantization/pruning configura-
tions are performed to achieve better accuracy-resource
trade-offs. SymbolNet [67] leverages a gradient-based
method for neural symbolic regression. It also uses an
adaptive dynamic pruning scheme to reduce on-chip re-
source consumption while maintaining the accuracy.
The HGQ trained models, HGQ 1 through 6, are taken
from the same training in which
β
is gradually increased.
The models is initialized with 2 fractional bits for activa-
tions, and a bitwidth of 2 excluding the sign bit for the
weights. This model is fully unrolled, and per-parameter
quantization is applied. Throughout the training process
of 300
,
000 epochs,
β
is gradually increased from 10
6
to
10
4
. Due to the model’s compact size, the entire train-
ing completes in
4 hours on a modern consumer GPU
with a batch size of 33
,
200.
As shown in Figure III and Table I, the HGQ approach
outperforms all previous works on quantized neural net-
works by significant margins, both in terms of model ac-
curacy and resource usage. Depending on the working
point, HGQ may reduce the resource consumption from
50% to up to 95% while maintaining the same accuracy.
When working with a lower accuracy requirement, HGQ
could also achieve similar resource consumption to an
optimized symbolic classifier.
We also studied the performance of the HGQ models
trained with fixed
β
values. In Figure III and Table I,
these correspond to
HGQ-c1
and
HGQ-c2
, which are
trained with fixed
β
’s of
2.1e-6
and
1.2e-5
, respec-
tively. Both models are trained for 5,000 epochs with the
same batch size. By comparing with the forementioned
HGQ models, we observe that models trained with either
a constant or increasing
β
value achieved comparable bal-
ance between accuracy and resource consumption. This
suggests that a lengthy training process with a gradu-
ally increasing
β
value is not always necessary for using
HGQ to obtain optimal trade-offs between accuracy and
resource efficiency.
C. SVHN Classifier
We also benchmark HGQ on a computer vision task
and compare it to previous state-of-the-art works [45,
64] on real-time inferences. We make use of the SVHN
dataset [69] which consists of 32
×
32 RGB images of house
30k
100k
300k
Resource (LUT + 55
DSP)
88
89
90
91
92
93
94
Accuracy (%)
BP 14-bit
Q 7-bit
QP 7-bit
AQ
AQP
BP-DSP-RF=3
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
FIG. IV. Accuracy versus resource consumptions of the SVHN
classifier models. Note that models with different DSP and
LUT consumption could land on the same point on this plot
due to taking a linear combination of DSPs and LUTs.
numbers taken from Google Street View, and the task is
to classify the digit in the center of the image into one
of ten classes. The architecture of the model is a LeNet-
like [70], convolution-dense network taken from [64].
We summarize the performance and resource usage of
all models we compared in Table II and visualize them in
Figure IV. In the table and figure, AutoQkeras Pruned
(AQP), AutoQkeras (AQ), QKeras Pruned 7-bit (QP 7-
bit), QKeras 7-bit (Q 7-bit), and Baseline Pruned (BP
14-bit) are taken from [64]. All of these models except
BP are trained quantization aware with
QKeras
. In par-
ticular, AQP, QP, and BP are pruned to a sparsity of 50%
iteratively with a magnitude-based method during train-
ing. AQP and AQ are heterogeneously quantized mod-
els, where the quantization configurations are optimized
through AutoQKeras’s hyperparameter tuner with Gaus-
sian Process. BP-DSP-RF=3 is cited from [45], where the
network is implemented in
QKeras
with a reuse factor of
three, and it formulates the trade-off between accuracy
and DSP usage as a snappack problem to perform opti-
mal pruning.
The HGQ trained models, HGQ 1 though 6, are taken
from a single training run during which the
β
value is
gradually increased. For training, we initialize the model
with 6 fractional bits for activations, and a bitwidth of 6
for weights excluding the sign bit. The
β
value is system-
atically increased from 10
7
to 10
4
over approximately
12
,
000 epochs. Completing this training process requires
10 hours on a modern consumer GPU with a batch size
of 2
,
048.
As this model is too large to fit on-chip if fully unrolled,
we use the stream IO implementation in
hls4ml
. This
10
TABLE II. Accuracy, resource usage, latency, and initiation intervals of the SVHN classifier models. Reported resource usage
for
HGQ
models are after place-and-route with an AMD
®
Virtex
®
UltraScale+
XCVU9P FPGA.
HGQ
models outperforms
the baseline models both in accuracy and resource consumption while maintaining comparable latency.
Model
Accuracy (%) Latency (cc)
DSP (%)
LUT (%)
FF (%)
BRAM (%) II (cc)
BP 14-bit [64]
93.
1,035 (5.18
μ
s) 48.85 (3,341) 12.27 (145,089) 2.77 (65,482) 3.08 (66.5)
1,030
Q 7-bit [64]
94.
1,034 (5.17
μ
s) 2.56 (175) 12.77 (150,981) 1.51 (35,628) 3.10 (67.0)
1,029
QP 7-bit [64]
94.
1,035 (5.18
μ
s) 2.54 (174)
9.40 (111,152) 1.38 (32,554) 3.10 (67.0)
1,030
AQ [64]
88.
1,059 (5.30
μ
s)
1.05 (72)
4.06 (48,027) 0.64 (15,242) 1.48 (32.5)
1,029
AQP [64]
88.
1,059 (5.30
μ
s)
1.02 (70)
3.28 (38,795) 0.63 (14,802) 1.39 (30.5)
1,029
BP-DSP-RF=3 [45]
92.
? (43.58
μ
s) 17.76 (1,215) 5.01 (59,279) 1.97 (46,584) 35.88 (1,550) 35.88
HGQ-1
93.9
1050 (5.25
μ
s)
0.85 (58)
5.87 (69,407) 1.18 (27853) 1.48 (32.0)
1029
HGQ-2
93.1
1061 (5.31
μ
s)
0.44 (30)
4.00 (47,314) 0.87 (20582) 1.30 (28.0)
1029
HGQ-3
91.9
1058 (5.29
μ
s)
0.22 (15)
3.39 (40,032) 0.76 (18087) 1.09 (23.5)
1029
HGQ-4
90.9
1059 (5.30
μ
s)
0.19 (13)
2.91 (34,435) 0.73 (17261) 1.04 (22.5)
1029
HGQ-5
89.9
1056 (5.28
μ
s)
0.15 (10)
2.60 (30,766) 0.64 (15205) 0.97 (21.0)
1029
HGQ-6
88.8
1056 (5.28
μ
s)
0.09 (6)
2.37 (27,982) 0.62 (14736) 0.97 (21.0)
1029
partitions the convolutional layers into smaller blocks by
individual kernel operations (i.e., partitioned by rows
in the im2col algorithm [71]) and compute them once
at a time at inference time [64]. Due to limitations
of the current implementation, intra-layer heterogeneous
activation quantization cannot be utilized with stream
IO. Hence, while the weights are quantized at the per-
parameter granularity, activations are quantized in layer-
wise blocks. Nevertheless, HGQ still outperforms both
baselines by a considerable margin of up to 40% in re-
source savings while maintaining similar accuracy and
latency.
D. Muon Tracker
For this task, we compare the resolution, latency, and
on-chip resource consumption of the HGQ trained mod-
els to models presented in [65] on a regression task pro-
posed in the same work. The task involves predicting
the incidence angle of a simulated muon track in a par-
ticle detector. The inputs are one 3
×
50 and two 3
×
50
binary-valued arrays, representing the hits recorded in
three detector stations. The output is a single scalar
value representing the angle in milliradians. We evalu-
ate the network’s performance in resolution, defined by
the root-mean-square of the angle’s reconstruction errors.
Following the same approach in [65], we exclude outliers
where the absolute error is greater than 30 milliradians.
The architecture of the model is a multistage neural net-
work taken from the original work.
The results, including the performance and resource
consumption of the models trained with HGQ and the
models proposed in the original work, are presented in
Table III and visualized in Figure V. The Quantized with
* fractional bits (Qf*) models presented in [65] are all
trained quantization aware with
QKeras
using manually
tuned parameters, where * stands for the number of frac-
tional bits used in all network parameters.
The HGQ trained models, HGQ 1 though 6, are taken
from a single training run during which the
β
value is
gradually increased. We initialize the model with 6 frac-
tional bits for activations, and a bitwidth of 6 exclud-
ing the sign bit for the weights. The model is fully
unrolled, and the quantization is applied at the per-
parameter granularity. The
β
value is systematically in-
creased from
3.e-6
to
6.e-4
over approximately 600
,
000
epochs, which takes
16 hours on a modern consumer
GPU with a batch size of 16
,
384.
The HGQ models consistently outperform the baseline
models with a reduction in resource consumption of 40
50%, while achieving the same or better resolution with
comparable latency.
20k
50k
100k
200k
Resource (LUT + 55
DSP)
2.0
2.2
2.4
2.6
2.8
Resolution (mrad)
Qf8
Qf7
Qf6
Qf5
Qf4
Qf3
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
FIG. V. Resolution versus resource consumptions of the muon
tracking models. Note that models with different DSP and
LUT consumption could land on the same point on this plot as
a result of taking the linear combination of DSPs and LUTs.
11
TABLE III. Resolution, resource consumption, latency, and initiation intervals of the Muon Tracker models. The resource
usage reported for
HGQ
models are after place-and-route with an AMD
®
Virtex
®
UltraScale+
XCVU13P FPGA.
HGQ
models outperforms the baseline models both accuracy and resource consumption for this task while maintaining comparable
latency.
Model
Resolution (mrad) Latency (cc)
DSP (%)
LUT (%)
FF (%)
BRAM (%) II (cc)
Qf8 [65]
1.95
17 (106.3 ns) 14.34 (1,762) 2.19 (37,867) 0.24 (8,443) 1.40 (37.5)
1
Qf7 [65]
1.97
11 (68.8 ns) 11.30 (1,389) 2.02 (34,848) 0.16 (5,433) 1.40 (37.5)
1
Qf6 [65]
2.04
13 (81.3 ns)
2.64 (324) 3.16 (54,638) 0.19 (6,525) 1.40 (37.5)
1
Qf5 [65]
2.15
11 (68.8 ns)
0.72 (88)
2.32 (40,039) 0.10 (3,419) 1.40 (37.5)
1
Qf4 [65]
2.45
10 (62.5 ns)
0.20 (24)
1.65 (28,526) 0.09 (2,954) 1.40 (37.5)
1
Qf3 [65]
2.78
9 (56.3 ns)
0.02 (2)
1.25 (21,682) 0.06 (2,242) 1.40 (37.5)
1
HGQ-1
1.95
11 (68.8 ns)
4.25 (522) 2.28 (39,413) 0.17 (6,043) 0.93 (25.0)
1
HGQ-2
2.00
11 (68.8 ns)
1.25 (154) 1.99 (34,460) 0.15 (5,263) 0.93 (25.0)
1
HGQ-3
2.09
12 (75.0 ns)
0.55 (68)
1.44 (24,941) 0.14 (4,677) 1.40 (37.5)
1
HGQ-4
2.20
13 (81.3 ns)
0.33 (41)
1.25 (21,557) 0.14 (4,699) 1.40 (37.5)
1
HGQ-5
2.39
10 (62.5 ns)
0.22 (27)
0.98 (16,918) 0.07 (2,484) 1.40 (37.5)
1
HGQ-6
2.63
12 (75.0 ns)
0.08 (10)
0.77 (13,306) 0.10 (3,429) 0.93 (25.0)
1
VI. CONCLUSION AND FUTURE WORK
In this work, we present HGQ, a novel method to op-
timize quantized neural networks for real-time applica-
tions on FPGAs and possibly also ASICs. The HGQ
approach enables the optimization of the quantization
bitwidths at arbitrary granularity, up to per-parameter
level, through a gradient-based approach that is con-
scious of both resource usage and loss minimization. By
maximally leveraging the ability of the specialized hard-
ware to perform fully heterogeneous computations, we
are able minimizes the resource consumption by the mod-
els while maintaining the model performance. In partic-
ular, our findings show that HGQ achieves up to a 95%
reduction in resource consumption compared to leading
compression techniques on FPGAs without performance
degradation. We further demonstrate that a single train-
ing session with HGQ is sufficient to explore a broad spec-
trum of trade-offs between model performance and re-
source utilization, efficiently recovering the Pareto Fron-
tier, thereby rendering the model optimization process
both more efficient and effective. Moreover, we intro-
duce EBOPs, a metric providing an accurate estimation
of the final on-chip resource consumption of a model as
a linear combination of LUTs and DSPs, allowing for
efficient software-hardware co-designs.
To facilitate adoption, we have developed a user-
friendly library that simplifies the application of this
method. The library offers an easy-to-use interface for
defining and training of quantized neural networks with
our method. Through interfacing with
hls4ml
, HGQ
provides bit-accurate conversions from software to FPGA
firmware models without the need of manual interven-
tion, significantly simplifying and streamlining the work-
flow from training to deployment.
We are looking forward to developing new neural net-
works based triggers for the CERN LHC experiments,
with the HGQ+hls4ml workflow for the incoming data
taking period. With the increased hardware efficiency,
we hope to enable more complex models to be deployed
on the trigger system, which could lead to more accu-
rate trigger decisions. For future improvements of this
method, we hope to develop a differential latency esti-
mator for the models. Though lower bitwidths generally
result in lower latencies, this relation does not hold in
some cases, like when the HLS backend switches between
DSP and LUT based arithmetic implementations. Also,
we would like to explore the possibility of having separate
LUT and DSP consumption estimators, as the resource
constraints of the two are not always interchangeable de-
pending on the specific application.
VII. ACKNOWLEDGEMENTS
C.S. is partially supported by the Caltech Danny Koh
grad fellowship. C.S. acknowledges partial support from
G ̈unther Dissertori. C.S. and M.S. acknowledge partial
support from the U.S. Department of Energy (DOE), Of-
fice of Science, Office of High Energy Physics grant DE-
SC0011925. T.
̊
A. is supported by the Swiss National Sci-
ence Foundation Grant No. PZ00P2
201594. J.N., M.S.,
and C.S. are partially supported by the U.S. Department
of Energy (DOE), Office of Science, Office of High En-
ergy Physics “Designing efficient edge AI with physics
phenomena” Project (DE-FOA-0002705). J.N. is par-
tially supported by the AI2050 program at Schmidt Fu-
tures (Grant G-23-64934). V.L. is supported by the NSF
Institute for Accelerated AI Algorithms for Data-Driven
Discovery (A3D3), under the NSF grant #PHY-2117997.
[1] Singh, R. & Gill, S. S.
Edge ai: A survey.
In-
ternet of Things and Cyber-Physical Systems
3
, 71–
92 (2023).
URL
https://www.sciencedirect.com/
12
science/article/pii/S2667345223000196
.
[2] Niu, W.
et al.
Grim: A general, real-time deep learn-
ing inference framework for mobile devices based on fine-
grained structured weight sparsity.
IEEE Trans. Pat-
tern Anal. Mach. Intell.
44
, 6224–6239 (2022). URL
https://doi.org/10.1109/TPAMI.2021.3089687
.
[3] Huang, K. & Gao, W. Real-time neural network infer-
ence on extremely weak devices: agile offloading with
explainable ai. In
Proceedings of the 28th Annual In-
ternational Conference on Mobile Computing And Net-
working
, MobiCom ’22, 200–213 (Association for Com-
puting Machinery, New York, NY, USA, 2022). URL
https://doi.org/10.1145/3495243.3560551
.
[4] Yang, Y.
et al.
Streamvc: Real-time low-latency voice
conversion (2024).
URL
https://google-research.
github.io/seanet/stream_vc/
.
[5] The LHC Study Group. The Large Hadron Collider,
Conceptual Design. Tech. Rep., CERN/AC/95-05 (LHC)
Geneva (1995).
[6] The CMS Collaboration. The Phase-2 Upgrade of the
CMS Level-1 Trigger. Tech. Rep., CERN, Geneva (2020).
URL
https://cds.cern.ch/record/2714892
. Final ver-
sion.
[7] The ATLAS Collaboration. Technical Design Report
for the Phase-II Upgrade of the ATLAS TDAQ Sys-
tem. Tech. Rep., CERN, Geneva (2017). URL
https:
//cds.cern.ch/record/2285584
.
[8] Zurbano Fernandez, I.
et al.
High-Luminosity Large
Hadron Collider (HL-LHC): Technical design report.
CERN Yellow Reports: Monographs
10/2020
(2020).
[9] Menghani, G.
Efficient deep learning: A survey on
making deep learning models smaller, faster, and better.
ACM Computing Surveys
55
, 1 – 37 (2021). URL
https:
//api.semanticscholar.org/CorpusID:235446458
.
[10] Li, Z., Li, H. & Meng, L. Model compression for deep
neural networks: A survey.
Computers
12
(2023). URL
https://www.mdpi.com/2073-431X/12/3/60
.
[11] Abadi, M.
et al.
TensorFlow: Large-scale machine learn-
ing on heterogeneous systems (2015). URL
https://
www.tensorflow.org/
. Software available from tensor-
flow.org.
[12] Chollet, F.
et al.
Keras.
https://keras.io
(2015).
[13] Duarte, J.
et al.
Fast inference of deep neural networks in
FPGAs for particle physics.
Journal of Instrumentation
13
, P07027–P07027 (2018). URL
https://doi.org/10.
1088/1748-0221/13/07/p07027
.
[14]
https://github.com/fastmachinelearning/hls4ml
.
[15]
https://github.com/calad0i/HGQ
.
[16] Zhou, S.
et al.
Dorefa-net: Training low bitwidth con-
volutional neural networks with low bitwidth gradients.
CoRR
abs/1606.06160
(2016). URL
http://arxiv.
org/abs/1606.06160
. 1606.06160.
[17] Lin, X., Zhao, C. & Pan, W. Towards accurate binary
convolutional neural network. In Guyon, I.
et al.
(eds.)
Advances in Neural Information Processing Systems
,
vol. 30 (Curran Associates, Inc., 2017). URL
https:
//proceedings.neurips.cc/paper_files/paper/2017/
file/b1a59b315fc9a3002ce38bbe070ec3f5-Paper.pdf
.
[18] Courbariaux, M., Bengio, Y. & David, J. Binaryconnect:
Training deep neural networks with binary weights dur-
ing propagations.
CoRR
abs/1511.00363
(2015). URL
http://arxiv.org/abs/1511.00363
. 1511.00363.
[19] Rastegari, M., Ordonez, V., Redmon, J. & Farhadi,
A. Xnor-net: Imagenet classification using binary con-
volutional neural networks. In Leibe, B., Matas, J.,
Sebe, N. & Welling, M. (eds.)
Computer Vision – ECCV
2016
, 525–542 (Springer International Publishing, Cham,
2016).
[20] Li, F., Liu, B., Wang, X., Zhang, B. & Yan, J. Ternary
weight networks (2022). 1605.04711.
[21] Zhu, C., Han, S., Mao, H. & Dally, W. J. Trained ternary
quantization (2017). 1612.01064.
[22] He, Z. & Fan, D. Simultaneously optimizing weight
and quantizer of ternary neural network using truncated
gaussian approximation.
2019 IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR)
11430–11438 (2018).
[23] Xu, C.
et al.
Alternating multi-bit quantization for recur-
rent neural networks.
CoRR
abs/1802.00150
(2018).
URL
http://arxiv.org/abs/1802.00150
. 1802.00150.
[24] Guo, Y., Yao, A., Zhao, H. & Chen, Y. Network sketch-
ing: Exploiting binary structure in deep cnns.
CoRR
abs/1706.02021
(2017). URL
http://arxiv.org/abs/
1706.02021
. 1706.02021.
[25] Zhang, D., Yang, J., Ye, D. & Hua, G. Lq-nets: Learned
quantization for highly accurate and compact deep neural
networks.
CoRR
abs/1807.10029
(2018). URL
http:
//arxiv.org/abs/1807.10029
. 1807.10029.
[26] Qu, Z., Zhou, Z., Cheng, Y. & Thiele, L. Adaptive
loss-aware quantization for multi-bit networks.
CoRR
abs/1912.08883
(2019). URL
http://arxiv.org/abs/
1912.08883
. 1912.08883.
[27] Chang, S.-E.
et al.
Mix and match: A novel fpga-centric
deep neural network quantization framework. In
2021
IEEE International Symposium on High-Performance
Computer Architecture (HPCA)
, 208–220 (2021).
[28] Wang, K., Liu, Z., Lin, Y., Lin, J. & Han, S.
Hardware-centric automl for mixed-precision quantiza-
tion.
International Journal of Computer Vision
128
,
2035–2048 (2020).
URL
https://doi.org/10.1007/
s11263-020-01339-6
.
[29] Lou, Q., Guo, F., Kim, M., Liu, L. & Jiang., L. Au-
toq: Automated kernel-wise neural network quantiza-
tion. In
International Conference on Learning Represen-
tations
(2020). URL
https://openreview.net/forum?
id=rygfnn4twS
.
[30] Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W.
& Keutzer, K.
HAWQ: hessian aware quantiza-
tion of neural networks with mixed-precision.
CoRR
abs/1905.03696
(2019). URL
http://arxiv.org/abs/
1905.03696
. 1905.03696.
[31] Dong, Z.
et al.
HAWQ-V2:
hessian aware trace-
weighted quantization of neural networks.
CoRR
abs/1911.03852
(2019). URL
http://arxiv.org/abs/
1911.03852
. 1911.03852.
[32] Yao, Z., Gholami, A., Keutzer, K. & Mahoney, M. W.
Pyhessian: Neural networks through the lens of the
hessian.
2020 IEEE International Conference on Big
Data (Big Data)
581–590 (2019). URL
https://api.
semanticscholar.org/CorpusID:209376531
.
[33] Choi, J.
et al.
Bridging the accuracy gap for
2-bit quantized neural networks (QNN).
CoRR
abs/1807.06964
(2018). URL
http://arxiv.org/abs/
1807.06964
. 1807.06964.
[34] Frantar, E., Singh, S. P. & Alistarh, D. Optimal brain
compression: a framework for accurate post-training
quantization and pruning. In
Proceedings of the 36th In-
ternational Conference on Neural Information Processing
13