of 14
Gradient-based Automatic Per-Weight Mixed
Precision Quantization for Neural Networks
On-Chip
Chang Sun
1
,
2
, Thea K.
̊
Arrestad
1
, Vladimir Loncar
3
,
4
, Jennifer Ngadiuba
5
, Maria Spiropulu
2
1
ETH Z
̈
urich, Z
̈
urich, Switzerland,
2
California Institute of Technology, Pasadena, CA, USA
3
Massachusetts Institute of Technology, Cambridge, MA, USA
4
Institute of Physics Belgrade, Serbia
5
Fermi National Accelerator Laboratory, Batavia, IL, USA,
Email:
{
chang.sun, thea.aarrestad, vladimir.loncar, jennifer.ngadiuba, maria.spiropulu
}
@cern.ch
*: Corresponding author
Abstract
—Model size and inference speed at deployment time,
are major challenges in many deep learning applications. A
promising strategy to overcome these challenges is quantization.
However, a straightforward uniform quantization to very low
precision can result in significant accuracy loss. Mixed-precision
quantization, based on the idea that certain parts of the network
can accommodate lower precision without compromising perfor-
mance compared to other parts, offers a potential solution. In
this work, we present High Granularity Quantization (HGQ), an
innovative quantization-aware training method designed to fine-
tune the per-weight and per-activation precision in an automatic
way for ultra-low latency and low power neural networks
which are to be deployed on FPGAs. We demonstrate that
HGQ can outperform existing methods by a substantial margin,
achieving resource reduction by up to a factor of 20 and latency
improvement by a factor of 5 while preserving accuracy.
I. I
NTRODUCTION
Edge computing has significantly increased the importance
of real-time deep neural network (DNN) inference on spe-
cialized hardware [1]. The typical latency threshold for real-
time inference is
O
(1)
ms [2], [3], [4]. Nevertheless, certain
domains require sub-microsecond inference times. At the
CERN Large Hadron Collider (LHC), detectors generate tens
of terabytes of data every second from collisions occurring
every 25 nanoseconds. This data throughput is managed by a
real-time selection system, the trigger. This system determines
the fate of each collision event - whether it should be preserved
for analysis or discarded - with a decision-making latency
ceiling of a few microseconds [5], [6]. The trigger’s precision
is vital to retain only interesting events, thereby managing the
bandwidth effectively and reducing the event rate significantly.
The system consists of
O
(1000)
field programmable gate ar-
rays (FPGAs) mounted on custom boards. Several algorithms
are running in parallel on each FPGA. As a result, resources
are scarce and the memory footprint of each algorithm should
be minimal. In anticipation of the LHC’s upgrade to the
High Luminosity-LHC (HL-LHC) [7], which will multiply the
collision rate considerably by a factor of
2
3
comparing
the current one [5], [6], and machine learning techniques are
being explored to enhance the speed and accuracy of the
computational tasks in the hardware trigger.
However, integrating demanding models - when resource
consumption and latency are strictly limited - without compro-
mising performance is a hurdle. Efforts in recent years have
focused on algorithmic efficiency, with strategies ranging from
the design of compact networks to weight pruning and quan-
tization [8], [9]. Quantization converts model parameters into
lower-precision formats, causing some loss in performance.
Although post-training is computationally cheaper to perform,
the loss in performance is significant in general compared to
the full-precision baseline. To mitigate this, quantization-aware
training has been proposed, which adheres to a fixed numerical
precision throughout training to mitigate this performance
degradation.
The satisfy the latency requirements, neural networks on
FPGAs for LHC physics experiments are usually fully unrolled
and pipelined - all arithmetic operations are done by a different
component in the circuit without overlapping, maximizing
throughput and minimizing latency. To explore this property,
recent researches [10], [11] have suggested that applying
varying levels of quantization to different layers could further
optimize accuracy against computational costs.
In this paper, we introduce the high-granularity quantization
(HGQ) method, which allows models to be trained quantiza-
tion aware at arbitrary granularity: In contrast to what is done
in the QAT library
QKeras
, where weights and activations are
processed in layerwise blocks for quantization, HGQ enables
weights and activations within one layer to have different
bitwidths. For a fully unrolled implementation, we can allow
every weight and activation to have its own unique bitwith. We
illustrate the key difference between the HGQ method and
the conventional block-wise quantization methods in Fig. I.
Optimizing quantization parameter at higher granularity allows
HGQ to find a better trade-off relation between model accuracy
and resource consumption. Furthermore, by optimizing these
1
arXiv:2405.00645v1 [cs.LG] 1 May 2024
individual bitwidths alongside the network using gradient
descent, the need for training the network multiple times to
search for a favorable quantization bitwidth for each block of
the network could also be eliminated.
When multiplication operations in neural networks primar-
ily involve low-bitwidth operands implemented with look-up
tables (LUTs), HGQ could demonstrate a substantial reduction
in on-chip resource consumption by eliminating unnecessary
computations without compromising performance. Depending
on the specific task, we demonstrate that HGQ has the
potential to outperform
AutoQKeras
and achieve resource
reduction by up to a factor of 20, and latency improvement
by a factor of 5 while preserving accuracy.
A functional HGQ framework has been developed us-
ing Tensorflow and Keras, and we have open-sourced it as
free software. The Vivado/Vitis FPGA back-end is supported
through integration with
hls4ml
. The library guarantees an
exact correspondence between the software and firmware mod-
els, provided that no numeric overflow occurs and intermediate
values are representable by
float32
.The work presented in
this paper makes the following contributions:
We present a new algorithm for obtaining surrogate gra-
dients of parameter bitwidths, from both the loss function
and the estimated model resource consumption, enabling
full gradient-based optimization of bitwidths;
We enable heterogeneous quantization of a specific model
at arbitrary granularity up to per-parameter level, aiming
to minimize hardware resource usage while preserving
high accuracy. This approach naturally includes sparse
pruning of network parameters by setting their bitwidth
to zero, further reducing resource-cost;
We have made this library easily available online in an
easy-to-use library, called
HGQ
1
, where simple drop-
in replacement of Tensorflow Keras layers makes it
straightforward for users to transform Keras models to
their equivalent deep heterogeneously quantized versions,
which are trained quantization aware;
We have added support for quantized HGQ models in
the library,
hls4ml
, which converts these pre-trained
quantized models into highly-parallel FPGA firmware for
ultra low-latency inference.
Using HGQ in combination with
hls4ml
ensures exact
bit-level accuracy between the HGQ software model and
the corresponding firmware model, making the library
safe and easy to use for non-experts;
We propose a new metric called Effective Bit Operations
(EBOPs) for a more accurate estimation of on-chip re-
source consumption;
We demonstrate a resource reduction of up to 95% and
a 5-fold improvement in latency, all while maintaining
accuracy compared to other state-of-the-art methods.
II. R
ELATED WORK
Network compression has been shown to be an effective
way to reduce the computational cost of neural networks
1
https://github.com/calad0i/HGQ
on FPGAs. Quantization is a widely adopted method for
compressing deep neural networks (DNNs) for implementing
them on hardware devices such as FPGAs or ASICs. Previous
studies have utilized low precision quantization, such as binary
or ternary, across networks to enhance throughput and reduce
latency. Binary quantization restricts weights to
α
×−
1
,
1
, and
ternary to
α
×−
1
,
0
,
1
, with
α
as a scaling factor. Key examples
include DoReFa Net [12], ABC-net [13], Binaryconnect [14],
XNOR-net [15], TWN [16], TTQ [17], and [18]. These
methods achieve high compression but at the cost of reduced
performance compared to standard floating-point networks.
Using binary network principles, several studies have moved
to multi-bit network designs that represent numbers through
binary bases and values, highlighted in works like [19], [20],
[13], [21], [22]. Mix&Match [23], in particular, uses power-
of-two bases for better hardware compatibility.
Many studies have investigated heterogeneous quantization
with layer-specific precision to lessen the performance loss due
to quantization. In particular, in HAQ [24] utilizes reinforce-
ment learning to find the best bitwidth configuration. HAWQ,
HAWQ-V2, PyHessian, and Q-BERT [25], [26], [27], [28] fo-
cus on optimizing bitwidths through hessian-aware techniques.
DNAS [29] and AutoQKeras [10] optimize bitwidths and net-
work architecture simultaneously,with DNAS using stochastic
sampling from a super network and AutoQKeras employing
gradient-free methods like Gaussian Process, Hyperband, and
stochastic search for hyperparameter optimization. Similarly,
Meta-ML [30] applies iterative optimization to various hyper-
parameters, including bitwidths, weight pruning, and model
architectures.
Some works, like RVQuant [31], BitsandBytes [32], and
SpQR [33], have investigated heterogeneous quantization
down to the sub-layer level, offloading outlier weights to
higher precision formats primarily for model compression
for large models rather than significant performance gains
on FPGAs. AutoQ [34] utilizes reinforcement learning to
optimize bitwidths for kernel weights and activations. A study
more aligned with ours is the recent FILM-QNN [35], which
optimizes weight and activation precision in a manner con-
ducive to hardware efficiency. It categorizes convolution layer
filters into groups of low and high precision, assigning them
based on anticipated quantization loss for each filter.
Pruning is another technique used to compress neural net-
works, enhancing their speed during hardware inference. This
method involves removing weights that have minimal impact
on the overall accuracy of the network. This concept was
first introduced in [36], and was applied to neural networks
in [37]. Pruning can be categorized as structured, involving
the removal of weights in specific blocks (as in [38], [39],
[40]), or unstructured, targeting individual weights (as in
[41], [42], [43], [44], [45], [40]). In this work, we consider
pruning as a form of quantization where pruned weights are
effectively quantized to zero bits. The
QKeras
[10] frame-
work, like ours, aims to train and optimize neural networks
for deployment on FPGAs.
Qkeras
is developed on top of
Tensorflow Keras [46] and leverages
hls4ml
[47] for FPGA
2
Quantization
HGQ
Pruning
Quantization
+ Pruning
Fig. I.
Overview of the HGQ method, showing activations (circles) and weights (lines) with thickness indicating bitwidth. Connections are dropped when
weight or activation values are constantly zero. Top left: baseline network with high precision throughout. Top right: network quantized layer-wise, e.g.,
using QKeras. Bottom right: network both quantized layer-wise and pruned. Bottom left: network quantized using HGQ, applying more detailed quantization
and assigning high bitwidths only where needed, on a per-weight and activation basis. This approach reduces resource use by maximally utilizing FPGA’s
heterogeneous computation.
deployment. It specializes in training and optimizing neural
networks, allowing for the use of arbitrary precision fixed-
point numbers for both weights and activations. AutoQK-
eras, a feature within Qkeras, enables automatic adjustment
of quantization settings for each layer using a gradient-free
approach. This can lead to significant compression, including
the use of binary or ternary networks. Typically,
hls4ml
is employed as the backend for deploying on FPGAs. It
specializes in training and optimizing neural networks, allow-
ing for the use of arbitrary precision fixed-point values for
both weights and activations.
AutoQKeras
, a feature within
Qkeras
, enables automatic tuning of quantization settings
for each layer using a gradient-free approach. This can lead
to significant compression, including the use of binary or
ternary networks [11]. Brevitas [48] serves as the PyTorch [49]
equivalent of
Qkeras
, commonly paired with the
FINN
and
FINN-R
frameworks from AMD Research [50], [51] for
deploying on AMD FPGAs.
III. H
IGH
G
RANULARITY
Q
UANTIZATION
In this paper, we introduce High Granularity Quantization
(HGQ). This is a novel quantization approach that allows for
up to individual precision levels within a single layer, offering
the unique capability for each parameter in a network to have
its own bitwidth. We begin this section by outlining the fun-
damentals of quantization and Quantization-Aware Training
(QAT). Subsequently, we introduce an innovative gradient-
based technique for auto-tuning the quantization bitwidth
during training. A comprehensive explanation of the HGQ
method and its algorithm follows. This approach is designed
to improve the accuracy-resource/latency balance compared
to previously studied block-wise heterogeneous quantization
methods in neural networks.
A. Quantization
Quantization is a map, henceforth referred to as
f
q
, that
transforms a real number into a finite set of discrete values,
mapping from the set of real numbers
R
to a discrete subset
Q
≡{
q
i
|
q
i
+1
> q
i
}⊂
R
. For hardware efficiency, we ensure
that quantized weights and activations are represented as fixed-
point numbers, a common practice in hardware for numerical
representation. A fixed-point number is essentially an integer
scaled by a predefined factor, typically powers of two. It is
characterized by its bitwidth (total number of bits) and the
number of bits allocated for the integer portion. The inclusion
of the sign bit in the integer part, for signed numbers, varies
by convention. In this context, we adhere to the convention
used in Xilinx
®
Vivado
®
/Vitis
®
HLS, which includes the sign
bit in the integer part if present. Adhering to the standard for
a fixed-point number with
b
N
+
bits, where
i
Z
bits are
dedicated to the integer part. We define
f
as the number of
fractional bits, calculated by
f
b
i
. For signed numbers,
the representable range is
[
2
i
1
,
2
i
1
2
f
]
, with a step
size of
2
f
. For unsigned numbers, the range is [0,
2
i
2
f
],
sharing the same step size.
One way of quantizing a real number into a fixed-point for-
mat,
fixed<b,i>
, can be expressed by a rounding function
as follows:
f
q
(
x
) =

x
·
2
f

+ 2
b
1
mod 2
i

2
b
1

·
2
f
=
(

x
·
2
f

·
2
f
,
if
x
[
2
i
1
,
2
i
1
2
f
]
overflow
otherwise
,
(1)
where
[
x
]
≡⌊
x
+
ε
with
ε
[0
,
1)
and
f
b
i
. Note that
setting
ε
= 1
/
2
applies conventional rounding to the nearest
integer. In this context, “overflow” implies that a value exceeds
the representable limits of the fixed-point format, causing a
3
cyclical wrap to the opposite end of the range. Although
a quantization function could be designed to adjust values
outside the permissible range to the closest valid value (for
instance, by clipping them to the range limits), this approach is
intentionally avoided in our work to avoid resource overhead.
By judiciously selecting the quantization range, we ensure that
overflow does not occur.
For
an
unsigned
fixed-point
number,
denoted
as
ufixed<b,i>
, the quantization function is described
below, using the same terminology:
f
q
(
x
) =

x
·
2
f

mod 2
i

·
2
f
(2)
=
(

x
·
2
f

·
2
f
,
if
x
[0
,
2
i
2
f
]
overflow
otherwise
,
(3)
.
In our approach, we only track the number of fractional
bits of the fixed-point number during training. Before de-
ploying the network for hardware synthesis (e.g., into HLS
projects), we calculate the required number of integer bits to
avoid overflow. This task is trivial for weights, as they are
only constants after training. For intermediate accumulator
and activation values, we employ a calibration dataset to
gauge the extreme values (both maximum and minimum)
the values might assume. This process involves running the
dataset through the network and logging the extreme quantized
values,
v
q
max
, and
v
q
min
. Given the fixed-point number’s range
of [
2
i
1
,
2
i
1
2
f
], we can determine the necessary integer
bit width,
i
, using:
i
= max(
log
2
|
v
q
max
|⌋
+ 1
,
log
2
|
v
q
min
|⌉
)
.
(4)
By ensuring the calibration dataset accurately reflects the in-
put data distribution the network will encounter in deployment,
we can guarantee that overflow will not occur. For extra safety,
one may also add margin to the computed range to account for
potential outliers in the input data. This method eliminates the
need to consider the representational range during the training
phase. Therefore, the quantization function during training can
be expressed as:
f
q
(
x
) =

x
·
2
f

·
2
f
=
(
x
+
ε
)
·
2
f
⌋·
2
f
.
(5)
Without loss of generality, we assume
ε
= 1
/
2
for the rest
of this chapter. This choice does not affect any of the results
or conclusions drawn in this work.
B. Quantization-Aware Training
Quantization-aware training (QAT) trains neural networks
by applying quantization directly during the training pass.
Previous work, e.g [10], demonstrates that QAT significantly
reduces the accuracy loss typically caused by quantization. In
this work, we adopt the QAT method utilized in [10] as the
foundational technique for our HGQ method. Specifically, we
employ the straight-through estimator (STE) [52] for weights
and activations quantization, which quantizes the values during
the forward pass while acting as an identity for computing the
gradients in the backward pass. This strategy maintains a good
balance between effective quantization and overhead during
training.
C. FPGA resource consumption
A common metric for estimating on-chip resource usage in
FPGAs, Bit Operations (BOPs) [53]. BOPs quantify resource
on the FPGA by counting the number of bit operations
performed during the network’s inference. For two numbers
defined in bitwidths
b
i
and
b
j
, the number of BOPs is
b
i
·
b
j
for
a multiplication operation and max
(
b
i
,b
j
) + 1
for an addition
operation. However, BOPs falls short in accurately reflecting
resource consumption in many cases for a fully unrolled neural
network on an FPGA.
This discrepancy arises from the fact that the multiplication
operations is usually between a fixed constant and a variable.
In particular, for an unrolled implementation on hardware:
1) Declaring a constant in fixed-point format of
b
bits does
not necessary mean that all
b
bits are used. For instance,
a weight of 0.5 in an 8-bit fixed-point format only uses 1
bit instead of 8 bits, and counting it as 8 in BOPs leads
to an overestimation of resource usage.
2) BOPs tends to double count an accumulation operation
that follows directly after another multiplication operation
between constant and variable: multiplication between a
constant and a variable can be implemented as a series of
additions of shifted values of the variable. The operation
count for a single multiplication involving
b
i
and
b
j
bitwidths thus becomes either
b
i
·
(
b
j
1)
or
(
b
i
1)
·
b
j
. In
scenarios involving multiplication-accumulation, the bit
operations are approximated as
b
i
·
b
i
.
To address this discrepancy and offer a more precise esti-
mation of on-chip resource usage, we propose a novel metric,
Effective Bit Operations (EBOPs).
The bitwidth for constants used for EBOPs is not the
declared bitwidth, but the number of bits that are enclosed
by non-zero values. For instance, a weight represented as
001xx1000
will be counted as 4 bits instead of 8 bits.
This approach ensures that the resource estimation is not
overestimated by the declared bitwidth.
To address the second issue, EBOPs quantifies only the
cumulative BOPs conducted during multiplicative processes in
a network. Let
M
=
{{
i,j
}
n
}
be the set of all multiplication
operations between operands with bitwidths
b
i
and
b
j
. The
total number of EBOPs can then be expressed as:
EBOPs =
X
i,j
∈M
b
i
·
b
j
.
(6)
Here bit operations in accumulation processes in avoided
intentionally, under the assumption these are implicitly in-
cluded within the EBOPs framework to avoid the second issue
mentioned above.
Experimental findings validate EBOPs as a reliable esti-
mator for on-chip resource consumption, closely mirroring a
linear combination of LUT and DSP usages. Detailed results
are discussed in Sec. V. To get a accurate resource estimation
from EBOPs, one should only including operations that will
be executed in parallel. For instance, different inputs fed to
the same multiplier through a FIFO buffer should be counted
4
only once (e.g. the implementation of convolutions in
hls4ml
in general). Additionally, this estimation does not include
overhead from non-multiplicative processes (e.g., buffers used
in
hls4ml
’s
io_stream
implementation). Though, note
that it is feasible to estimate them separately in other means
and add these additional overheads to the final result.
D. Gradient-based optimization of bitwidths
To obtain a fully-unrolled quantized neural network with
minimum resource- or area-usage on-chip, we want each
weight and activation bitwidth to be individually optimized.
However, in this way, the number of bitwidth parameters could
exceed the number of trainable parameters in the original
network. The only feasible approach to managing such a
vast parameter space is through gradient-based optimization.
Nonetheless, direct optimization of these discrete bitwidth
values via gradients is not possible due to the absence of a
direct gradient path from the loss function to the bitwidths.
Therefore, we address two main issues:
a)
make the discrete
bitwidths optimizable with a gradient; and
b)
estimate a
surrogate gradient for these bitwidths.
1) Optimize discrete bitwidths with gradient:
The first issue
can be straightforwardly addressed by treating the discrete
bitwidths similar to the discrete weights in a quantized net-
work. In particular, we apply the straight-through estimator
(STE) to real-numbered bitwidths as it is done for the weights,
and we follow the STE implementation used in QKeras:
ste(
x
) =
x
+ sg([
x
]
x
)
,
(7)
where
sg :
R
R
is an identity function that detaches
the gradient from the enclosed expression. In this way, the
bitwidths can be optimized if they have gradients attached.
Continuous values for the bitwidths are stored, and they are
only rounded to integers as needed during forward passes.
During backward passes, the rounding operations default to
the identity.
2) Surrogate gradient for bitwidths:
To address the second
issue, we first consider some parameter
x
(e.g., weight or
activation) in the network and its corresponding quantizer
f
q
(
·
)
. If we require that the quantized number has at most
f
fractional bits, its associated quantization error
δ
f
can be
expressed as follows with
ε
= 1
/
2
:
δ
f
x
f
q
(
x
) =
x

x
·
2
f

·
2
f
.
(8)
During training, we assume
x
to be a random variable
following a certain smooth distribution
D
x
. We further as-
sume that the variance of
D
x
is significantly larger than the
quantization error
δ
f
in such a way that one can approximate
the quantization error as a uniform distribution:
δ
f
Uniform(
2
f
1
·
,
2
f
1
)
.
(9)
Let the loss of the network be
L
, and express the gradient
of
f
with respect to
L
as
L
∂f
=
L
∂δ
·
∂δ
∂f
.
(10)
In this expression, the first term
L
∂δ
can be obtained trivially
with backpropagation. The second term
∂δ
∂f
is not well-defined,
as
f
can only take integer values for a properly defined
quantizer and thus has no gradient. To address this issue, we
propose a surrogate gradient method that assigns a gradient to
f
only on integer values.
We now express the loss as a function of the weights
θ
and
all the quantization errors
δ
,
L
(
θ
,
δ
)
. To obtain the surrogate
gradient of
f
, we assume that the loss function is sensitive
to the magnitude of the quantization error, but not the sign:
L
(
θ
,
|
δ
|
)
.
For a parameter
x
∼ D
x
with
f
Z
floating bits to be
quantized, the corresponding absolute quantization error is
|
δ
f
| ≡ |
x
f
q
f
(
x
)
| ∼
Uniform(0
,
2
f
1
)
. By increasing
f
by one, we obtain the absolute quantization error
|
δ
f
+1
|
as a
function of
f
and
|
δ
f
|
:
|
δ
f
+1
|
=
(
|
δ
f
|
|
δ
f
|≤
2
f
2
2
f
1
−|
δ
f
| |
δ
f
|
>
2
f
2
.
(11)
We can then obtain the gradient of
|
δ
f
|
with respect to
f
using the finite difference approximation.
|
δ
f
|
∂f
←|
δ
f
+1
|−|
δ
f
|
.
(12)
However, as the absolute quantization error is bounded by
a geometric sequence of
2
f
1
, using a linear difference for
approximation is suboptimal. Instead, we use the following
heuristic expression to approximate the gradient, which re-
covers Eq. (12) at the limit of
|
δ
f
+1
|→|
δ
f
|
:
|
δ
f
|
∂f
←|
δ
f
log
|
δ
f
+1
|
|
δ
f
|
.
(13)
Expressing the ratio of
|
δ
f
+1
|
and
|
δ
f
|
as a function of
|
δ
f
|
,
we have
|
δ
f
+1
|
|
δ
f
|
=
(
1
|
δ
f
|≤
2
f
2
2
f
1
|
δ
f
|
1
|
δ
f
|
>
2
f
2
.
(14)
One may get a gradient surrogate by combining Eq. (13)
and Eq. (14). However, using the local relation as expressed in
Eq. (14) between
|
δ
f
+1
|
and
|
δ
f
|
could lead to a loss landscape
for
f
with extensive high-frequency components that is hard
to optimize. To mitigate this issue and smooth out the loss
landscape, we take the expectation of the first term of Eq. (13)
over
|
δ
f
|
:
E
|
δ
f
|

log
|
δ
f
+1
|
|
δ
f
|

=
log 2
.
(15)
By substituting Eq. (15) into Eq. (12), and add a
sign(
δ
f
)
term on both hand sides, we obtain the surrogate gradient for
f
:
∂δ
f
∂f
←−
log 2
·
δ
f
.
(16)
Hence, the forward pass of the quantizer, with respect to one
input value
x
and its float bitwidth
f
, can be expressed as in
5
Algorithm 1:
Quantizer forward pass
Data:
x
: the input value;
f
: the float bitwidth
Result:
x
q
: the quantized value of
x
with float
bitwidth
f
f
ste(
f
)
;
x
q
sg([
x
·
2
f
]
·
2
f
)
;
δ
sg(
x
x
q
)
;
// Standard STE-based
quantization
δ
sg(
δ
+ ln 2
·
f
·
δ
)
ln 2
·
f
·
δ
;
// Attach
gradients of
f
to
δ
x
q
x
δ
;
// Attach gradients of
f
and
x
to
x
q
return
x
q
Algorithm 1, and the backward pass is the auto-differentiation
of the forward pass with the stop-gradient operations.
Quantization often results in higher loss values, causing
the gradients from the loss function to propagate to the
bitwidths and increase them. To address this, we introduce a
regularization term to prevent the bitwidths from growing too
large. We use EBOPs as this regularization term, incorporating
it into the loss function with a regularization coefficient
β
to
balance accuracy against on-chip resource usage. Moreover,
for network values not involved in multiplicative operations
(such as last-layer outputs or inputs to non-linear activations),
we apply an L-1 regularization with a coefficient
γ
to the
bitwidths, preventing them from expanding unnecessarily. The
final loss function is given by
L
=
L
base
+
β
·
EBOPs +
γ
·
L1
norm
,
(17)
with gradients attached to the bitwidths.
As all additional gradients introduced in this section are
directly added to the bitwidths, the loss landscape of the
network’s weights remains unperturbed compared to that of
networks with static quantization parameters. Consequently, it
eliminates the alterations to the loss landscape which are intro-
duced by regularization-based quantization methods, like [54],
[55].
3) Gradient for bitwidths with multiple parameters:
In
experiments, we noticed that when the parameter group size in-
creases while keeping the same
β
, the corresponding bitwidth
is more likely to collapse to zero and causing a breaking
down during training. This effect is hypothesized due to the
non-uniformity of gradients contributed by parameters in the
group. To mitigate such effects, we normalize the gradient
on
f
i
by
1
/
p
||
g
i
||
, based on empirical observations. Here,
||
g
i
||
denotes the number of elements in
g
i
. The quantizer’s
forward pass with respect to a parameter group
g
i
can be
described as in Algorithm 2 during training and Eq. (5) during
inference. The backward pass is derived from the forward pass
automatically.
4) Connection to Pruning:
From Eq. (5), it is observable
that the quantized value defaults to zero whenever
ε
·
2
f
x <
(1
ε
)
·
2
f
. Given that
f
can take both positive and
Algorithm 2:
Quantizer forward pass for a parameter
group
Data:
g
i
: the
i
-th parameter group;
f
i
: the number of
floating bits for
i
-th parameter group
Result:
(
g
q
)
i
: the quantized parameters of
i
-th
parameter group
if
g
i
is a group of weights
then
f
i
ste(
f
i
)
;
else
f
i
f
i
;
end
(
g
q
)
i
empty list
;
N
←||
g
i
||
;
forall
v
j
in
g
i
do
v
q
sg([
v
j
·
2
f
i
]
·
2
f
i
)
;
δ
sg(
v
j
v
q
)
;
δ
sg(
δ
+
ln 2
·
f
i
·
δ
N
)
ln 2
·
f
·
δ
N
;
v
q
j
x
δ
;
append
v
q
j
to (
g
q
)
i
;
end
return
(
g
q
)
i
negative values, a sufficiently small
f
with
ε >
0
will cause
certain parameters in the network to always be zero. This
is equivalent to pruning the parameter. Assigning a distinct
bitwidth to each parameter group in the network through
HGQ automatically prunes the network during training in an
unstructured, taking both loss and resource consumption into
account.
Listing 1. HGQ model example.
from
tensorflow . keras . layers
import
Input
from
HGQ
import
HQuantize, HDense
inp = Input ((16,) )
out = HQuantize(name=‘inp
q’)(out)
out = HDense(64, activation =‘ relu ’ ,
bops
reg
factor =r) ( out )
out = HDense(32, activation =‘ rel u’ ,
bops
reg
factor =r) ( out )
out = HDense(32, activation =‘ relu ’ ,
bops
reg
factor =r) ( out )
out = HDense(5, activation =‘ linea r ’ ,
bops
reg
factor =r) ( out )
hgq
model = Model(inp, out)
IV. T
HE
H
IGH
G
RANULARITY
Q
UANTIZATION
F
RAMEWORK
The HGQ algorithm is available as a user-friendly Python
library, similar to QKeras [10]. It functions as a sophisti-
cated quantization API built on top of Keras [56], utilizing
6
Listing 2. Keras model example.
from
tensorflow . keras . layers
import
Input , Dense
inp = Input ((16,) )
out = Dense(64, activation =‘ rel u’ ) ( out )
out = Dense(32, activation =‘ relu ’ ) ( out )
out = Dense(32, activation =‘ rel u’ ) ( out )
out = Dense(5, activation =‘ linear ’ ) ( out )
keras
model = Model(inp, out )
hls4ml
[47] for deployment. Additionally, this framework
facilitates automatic conversion of a
tensorflow.keras
model into a
hls4ml
model, ensuring bit-accuracy as per
the specifications of a dataset defined by the user, without
requiring manual intervention.
Following the methodology of QKeras, HGQ encompasses
most of the layers available in
hls4ml
, and adheres to the
design principles of Keras. HGQ is engineered to carry out
automatic quantization on all compatible layers according to
the EBOPs regularization factor,
β
. This approach eliminates
the necessity for users to fine-tune quantization parameters
for individual modules or undergo multiple training cycles to
identify the best quantization scheme.
HGQ provides drop-in replacement for the most commonly
used Keras layers, making it straightforward to transition from
a standard Keras model to an HGQ model with minimal adjust-
ments. For instance, as demonstrated in listing 2, converting to
an HGQ model primarily involves substituting existing layers
with their HGQ alternatives, as shown in listing 1, along with
the inclusion of an additional layer to quantize inputs. Within
the HGQ framework, there are two categories of layers: Het-
erogeneous (
H-
) layers, which accept an additional parameter,
beta
, to manage the layer’s resource usage regularization
based on MBOPs, and Passive (
P-
) layers, which serve to
relay metadata without performing quantization. The
H-
layers
also allow for layer-specific kernel and pre-activation quantizer
configurations (
kq_config
and
paq_config
, respectively)
for more precise control over quantization behaviors.
HGQ simplifies the process by auto-learning optimal quanti-
zation parameters during training, thus mostly freeing the user
from manually specifying bit widths and scaling factors for
each layer. Instead, users are primarily concerned with setting
the EBOPs’ regularization factor, although manual parameter
adjustment is still an option.
Beyond quantization-aware training, HGQ introduces a con-
venient intermediate layer or proxy models for transitioning a
trained Keras model to
hls4ml
. This feature accommodates
both
HGQ
and
QKeras
models, automating the creation of
hls4ml
configurations for precise conversions. Furthermore,
the proxy model facilitates bit-accurate emulation of the com-
piled
hls4ml
model, aiding in debugging and validating the
hls4ml
model’s performance through development, even in
cases of overflow within the constraints of available bit width,
up to the accuracy permitted by
float32
precision used in
the emulation.
A. Resource Consumption Surrogate
We consider five types of major resources on an Xilinx
FPGA chip: LUTs, DSPs, FFs, BRAMs, and URAMs. When
fitting an unrolled neural network for ultra low latency ap-
plications, like L1 Triggers in LHC, the limiting resources
are usually either LUTs or DSPs. Empirically, operations
consisting of larger bitwidths are more likely to consume
DSPs, while operations with smaller bitwidths are more likely
to consume LUTs. During our experiments, we observed that
the EBOPs roughly predict a linear combination of the LUTs
and DSPs consumption, namely, EBOPs
LUTs +
55
×
DSPs
for models synthesized with
io_type=io_parallel
in
hls4ml
, where intermediate values are directly wired be-
tween layers/modules.
In Fig. II, we demonstrate the relationship between EBOPs
and the actual on-chip resource consumption, represented
by post place and route LUTs +
55
×
DSPs, synthesized
with Xilinx Vivado 2020.1/Vitis 2023.2 for models shown
in this work. Although the relationship isn’t exact, we can
still make a reasonable estimation of resource usage based
on EBOPs, even during training. This suggests that treating
one DSP as equivalent to approximately 55 LUTs could
be a practical approximation for comparing resource usage
across different models, although this may not hold univer-
sally. It’s important to note that EBOPs primarily account
for operations involving a multiplication-accumulation of one
constant and one variable. Therefore, if operations other than
these significantly contribute to the consumption of on-chip
resources, the EBOPs-based estimation might not be reliable.
For instance, in the SVHN classifier model synthesized with
io_type=io_stream
, the resource usage by FIFO buffers
isn’t factored in, leading to an underestimation of total re-
source consumption as predicted by EBOPs.
V. R
ESULTS
To evaluate the performance of the HGQ framework, we
train and evaluate models on a classification, a computer vision
and a regression task: jet tagging at the LHC [10], SVHN digit
classification[57], and muon tracking[58], respectively.
To demonstrate the trade-off between accuracy (or reso-
lution) and resource usage, we methodically adjusted the
β
factor for each task during training to map out various optimal
points on the accuracy (resolution) vs. resource consumption
Pareto Front. This process involved initiating with a notably
low
β
value and incrementally raising it through the training,
capturing all models that align with the Pareto Front, defined
by validation accuracy (or resolution) and estimated resource
consumption via EBOPs. Meanwhile, we maintained the
γ
value fixed at
2.e-6
for all experiments to avert the risk
of layers diverging in bitwidths. Post-training, we conduct
a reassessment of the models using the test set, providing
details on accuracy or resolution based on
c-synthesis
,
7
10
3
10
4
10
5
EBOPS
10
3
10
4
10
5
LUT + 55×DSP
Jet Classifier
Muon Tracker
Fig. II. The relationship between EBOPs and resource consumption esti-
mated by
LUTs +
55
×
DSPs. The EBOPs roughly predicts a linear
combination of the LUTs and DSPs consumption for models synthesized with
io_type=io_parallel
. Models shown in this figure are from the three
tasks described in Section V. The relationship is not exact, but indicates that
one DSP is roughly equivalent to 55 LUTs when comparing the resource
consumption of different models.
and detailing resource consumption after the place-and-route
phase.
A. Jet Classification at the LHC
We conducted a comparison of the accuracy, latency, and
on-chip resource utilization of models trained with HGQ
against various quantized models from earlier research.
We use the dataset from [59]. This dataset is for classifying
jets, a kind of particle shower produced by high-energy
particles at the LHC experiments, into five classes based on
their originating particles: quark (q), gluon (g), W boson, Z
boson, and top (t) jets. The inputs for each jet are 16 scaler
values representing physics-motivated high-level features. The
model architecture employed is based on the full precision
baseline model described in the original work [10], which is
a 4-layer fully connected neural network. The exact model
architecture is shown in Fig. VI in extended data.
The results are summarized in Table I. In the table, the
following models are cited from [10]: BF, BP, BH, Q6, QE,
and QB. In this work, various techniques such as Quantization-
Aware Training (QAT), pruning, and automated parameter
optimization using a Gaussian Process (for the QE and QB
models) were used to achieve low resource consumption.
LogicNets JSC-M and JSC-L are cited from [60], where the
networks are designed to use on-chip LUTs efficiently. BP-
DSP-RF=2 is cited from [38], where the network is imple-
mented in QKeras with a reuse factor of two, and pruned
in a DSP-aware fashion to reduce the resource consumption.
MetaML-
α
q
=1%
and MetaML-
α
q
=4%
are cited from [30].
These two networks went through iterative architecture search,
quantization, and pruning for better accuracy-resource trade-
off. SymbolNet is cited from [61], which leverages a gradient-
based method for optimal symbolic expression searching. It
also uses an adaptive dynamic pruning scheme to reduce on-
chip resource consumption while maintaining accuracy.
As shown in Fig. III and Tab. I, the HGQ models outperform
the baseline models by a significant margin both in terms of
model accuracy and resource usage. Depending on the working
point, HGQ models reduce the resource consumption from
50% to up to 95%, while maintaining the same accuracy.
When working with a lower accuracy requirement, it could
also achieve similar resource consumption as an optimized
symbolic regressor.
The HGQ trained models, HGQ 1 through 8,are taken from
the same training run with ramping up
β
during training.
These models were initially set to use 2 bits for representing
the activations’ floating-point part and 2 bits in total for
the weights. Throughout the training process, which spanned
roughly 300,000 epochs,
β
was gradually increased from
10
6
to
10
4
. Due to the models’ compact size, this entire training
process could be completed in just a few hours using a
standard consumer-grade GPU.
We also studied the performance of the models trained
with fixed, non-zero
β
values. In Fig. III and Tab. I, these
correspond to
HGQ-c1
, trained with a fixed
β
of
2.1e-6
,
and
HGQ-c2
, using a
β
of
1.2e-5
. Each is trained for
5,000 epochs (lasting a few minutes on a consumer grade
GPU). Models trained with either a constant or incrementally
increasing
β
value are capable of achieving a comparable
balance between accuracy and resource consumption. From
this, we conclude that the method of progressively ramping
up
β
throughout the training process is effective in generating
a collection of models that represent an optimal compromise
between accuracy and resource efficiency, situating them fa-
vorably on the accuracy-resource Pareto Frontier.
B. SVHN Classifier
We also benchmark HGQ on a computer vision task and
compare it to previous state-of-the-art work [57], [38] on the
SVHN dataset [62]. The SVHN dataset consists of
32
×
32
RGB images of house numbers taken from Google Street
View. The task is to classify the digit in the center of the
image into one of ten classes. The architecture of the model
is a LeNet-like [63], convolution-dense network directly taken
from [57]. The exact model architecture is shown in Fig. VII
in extended data.
These results are summarized in Tab. II and Fig. IV. In the
table, the following models are taken from [57]: AQP, AQ, QP
7-bit, Q 7-bit, and BP 14-bit. All models are quantized, and
pruning to a sparsity of 50% is applied to AQP, QP and BP.
AQP and AQ are heterogeneously quantized models, where
the quantization configuration is obtained using a Gaussian
Process hyperparameter optimization, and are trained quan-
tization aware with
QKeras
. QP 7-bit, Q 7-bit, and BP 14-
bit are homeogeneously quantized models trained quantization
8
TABLE I
R
ESOURCE CONSUMPTION AND LATENCY OF THE JET TAGGING MODELS
. R
ESOURCE REPORTED FOR
HGQ
MODELS ARE AFTER PLACE
&
ROUTE
. I
N
THIS TASK
,
HGQ
MODELS OUTPERFORMS THE BASELINE MODELS BY A LARGE MARGIN IN BOTH ACCURACY AND RESOURCE CONSUMPTION
. T
HE
LGQ
MODEL IS NOT HIGH
-
GRANULARITY QUANTIZED
,
BUT ONLY USING GRADIENT
-
BASED BITWIDTH OPTIMIZATION
.
Model
Accuracy (%)
Latency (cc)
DSP
(%)
LUT
(%)
FF (%)
II (cc)
BF [10]
74.4
9 (45 ns)
56.0 (1,826)
4.09 (48,321)
0.8 (20,132)
BP [10]
74.8
14 (70 ns)
7.7 (526)
1.49 (17,577)
0.4 (10,548)
BH [10]
73.2
14 (70 ns)
1.3 (88)
1.34 (15,802)
0.3 (8,108)
Q6 [10]
74.8
11 (55 ns)
1.8 (124)
3.36 (39,782)
0.3 (8,128)
QE [10]
72.3
11 (55 ns)
1.0 (66)
0.77 (9,149)
0.1 (1,781)
QB [10]
71.9
14 (70 ns)
1.0 (69)
0.95 (11,193)
0.1 (1,771)
LogicNets JSC-M [60]
70.6
N/A
0 (0)
1.22 (14,428)
0.02 (440)
LogicNets JSC-L [60]
71.8
5 (13 ns)
0 (0)
3.21 (37,931)
0.03 (810)
BP-DSP-RF=2 [38]
76.3
21 (105 ns)
2.6 (175)
0.47 (5,504)
0.13 (3,036)
2
MetaML-
α
q
=1%
[30]
75.6
9 (45 ns)
0.7 (50)
0.57 (6,698)
N/A
1
MetaML-
α
q
=4%
[30]
72.8
8 (40 ns)
0.2 (23)
0.57 (7,224)
N/A
1
SymbolNet [61]
71.
2 (10 ns)
<
0.1 (3)
0.01 (177)
<
0.01 (109)
1
HGQ-1
76.4
6 (30 ns)
0.50 (34)
0.53 (6,236)
0.05 (1253)
1
HGQ-2
75.9
4 (20 ns)
0.09 (6)
0.27 (3,162)
0.02 (550)
1
HGQ-3
75.0
4 (20 ns)
0.07 (5)
0.13 (1,540)
0.02 (370)
1
HGQ-4
73.9
3 (15 ns)
0.00 (0)
0.05 (565)
0.01 (140)
1
HGQ-5
72.5
2 (10 ns)
0.00 (0)
0.04 (468)
0.01 (131)
1
HGQ-6
71.0
2 (10 ns)
0.00 (0)
0.02 (256)
0.00 (66)
1
HGQ-c1
76.3
8 (40 ns)
0.26 (18)
0.50 (5,899)
0.09 (2,072)
1
HGQ-c2
74.2
3 (15 ns)
0.00 (0)
0.06 (678)
0.01 (172)
1
0.1k
1k
10k
100k
Resource (LUT + 55
DSP)
71
72
73
74
75
76
77
Accuracy (%)
BF
BP
BH
Q6
QE
QB
LogicNets JSC-M
LogicNets JSC-L
BP-DSP-RF=2
MetaML-
q
= 1%
MetaML-
q
= 4%
SymbolNet
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
HGQ-c1
HGQ-c2
Fig. III. Accuracy versus resource consumption of the jet tagging models.
Note that models with different DSP and LUT usage could land on the same
point on this plot due to the linear combination of DSPs and LUTs.
aware with
QKeras
. BP-DSP-RF=3 is cited from [38], where
the network is implemented in QKeras with a reuse factor
of three, and pruned in a DSP-aware fashion to reduce the
resource consumption.
The HGQ trained models, HGQ 1 though 7, are taken from
a single training run during which the
β
value was gradually
increased. We initialize the models using 6 bits for the floating
points for activations, and 6 bits in total for weights. The
β
value was systematically increased from
10
7
to
10
4
over approximately 12,000 epochs. Completing this training
process required about 10 hours on a standard consumer-grade
GPU.
This model is too large to be fit on-chip fully unrolled, rather
we utilize
io_stream
in
hls4ml
to partition the convolu-
tional layers into smaller blocks. The resource consumption is
estimated by the sum of the resource consumption of each
block. For this reason, intra-layer heterogeneous activation
cannot be utilized, and only inter-layer heterogeneous weight
quantization is performed. Nevertheless, HGQ still outper-
forms both baselines by a considerable margin of up to 30%
in resource savings while maintaining similar accuracy and
latency.
C. Muon Tracker
We also compare the resolution, latency, and on-chip re-
source consumption from HGQ trained models to a regression
task proposed in [58].
This task involves predicting the polar-angle of a simulated
muon track in a simplified detector. The inputs are one
3
×
50
and two
3
×
50
binary-valued arrays, representing the hit
patterns in three detector layers. The output is a single scalar
value representing the polar angle of the track in milliradians.
Architecture of the model is a multistage neural network taken
from the original work [58]. The exact model architecture is
available in Fig. VIII in extended data.
The results are presented in Tab. III and Fig. V. The Qf
models presented in [58] are all trained quantization aware
with
QKeras
using manually tuned parameters.
The HGQ trained models, HGQ 1 though 7, are taken from
a single training run during which the
β
value was gradually
increased. We initialize the models with 6 bits for the floating
points for activations, and 6 bits in total for weights. The
β
value was systematically increased from
3.e-6
to
6.e-4
9
TABLE II
R
ESOURCE USAGE AND LATENCY OF THE CONVOLUTIONAL
SVHN
CLASSIFIER MODELS
. R
EPORTED RESOURCE USAGE FOR
HGQ
MODELS ARE AFTER
PLACE
&
ROUTE
. I
N THIS TASK
,
THE
HGQ-0.4
AND
HGQ-1.5
MODELS OUTPERFORM THE BASELINE
AQ
AND
AQP
MODELS BY A LARGE MARGIN IN
ACCURACY
,
AND ALSO USING LESS RESOURCES
.
Model
Accuracy (%)
Latency (cc)
DSP (%)
LUT (%)
FF (%)
BRAM (%)
II (cc)
BF 14-bit [57]
87
1,035 (5.18
μ
s)
93.23 (6,377)
19.36 (228,823)
3.40 (80,278)
3.08 (66.5)
1,030
BP 14-bit [57]
93
1,035 (5.18
μ
s)
48.85 (3,341)
12.27 (145,089)
2.77 (65,482)
3.08 (66.5)
1,030
Q 7-bit [57]
94
1,034 (5.17
μ
s)
2.56 (175)
12.77 (150,981)
1.51 (35,628)
3.10 (67.0)
1,029
QP 7-bit [57]
94
1,035 (5.18
μ
s)
2.54 (174)
9.40 (111,152)
1.38 (32,554)
3.10 (67.0)
1,030
AQ [57]
88
1,059 (5.30
μ
s)
1.05 (72)
4.06 (48,027)
0.64 (15,242)
1.48 (32.5)
1,029
AQP [57]
88
1,059 (5.30
μ
s)
1.02 (70)
3.28 (38,795)
0.63 (14,802)
1.39 (30.5)
1,029
BP-DSP-RF=3 [38]
92
? (43.58
μ
s)
17.76 (1,215)
5.01 (59,279)
1.97 (46,584)
35.88 (1,550)
35.88
HGQ-1
93.9
1050 (5.25
μ
s)
0.85 (58)
5.87 (69,407)
1.18 (27853)
1.48 (32.0)
1029
HGQ-2
93.1
1061 (5.31
μ
s)
0.44 (30)
4.00 (47,314)
0.87 (20582)
1.30 (28.0)
1029
HGQ-3
91.9
1058 (5.29
μ
s)
0.22 (15)
3.39 (40,032)
0.76 (18087)
1.09 (23.5)
1029
HGQ-4
90.9
1059 (5.30
μ
s)
0.19 (13)
2.91 (34,435)
0.73 (17261)
1.04 (22.5)
1029
HGQ-5
89.9
1056 (5.28
μ
s)
0.15 (10)
2.60 (30,766)
0.64 (15205)
0.97 (21.0)
1029
HGQ-6
88.8
1056 (5.28
μ
s)
0.09 (6)
2.37 (27,982)
0.62 (14736)
0.97 (21.0)
1029
30k
100k
300k
Resource (LUT + 55
DSP)
88
89
90
91
92
93
94
Accuracy (%)
BP 14-bit
Q 7-bit
QP 7-bit
AQ
AQP
BP-DSP-RF=3
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
Fig. IV. Accuracy versus resource usage of the SVHN Classifier models.
Note that models with different DSP and LUT consumption could land on
the same point on this plot due to taking a linear combination of DSPs and
LUTs.
over approximately 600,000 epochs. The complete run takes
around 16 hours on a single consumer grade GPU.
The HGQ models consistently outperform the baseline mod-
els with a
40
% reduction in resource consumption, while
maintaining the same resolution with comparable latency.
VI. C
ONCLUSION AND
F
UTURE
W
ORK
In this work, we present HGQ, a novel method to optimize
quantized neural networks for real-time applications on Field-
Programmable Gate Arrays (FPGAs). Maximally leveraging
the ability of FPGAs to perform fully heterogeneous compu-
tation, we introduce a new algorithm for precisely determining
the optimal quantization precision for each weight and acti-
vation to minimize resource consumption without sacrificing
the accuracy of the original model. To facilitate adoption,
20k
50k
100k
200k
2.0
2.2
2.4
2.6
2.8
Resolution (mrad)
Qf8
Qf7
Qf6
Qf5
Qf4
Qf3
HGQ-1
HGQ-2
HGQ-3
HGQ-4
HGQ-5
HGQ-6
Fig. V. Accuracy versus resource consumption of the muon tracking models.
Note that models with different DSP and LUT consumption could land on the
same point on this plot as a result of taking the linear combination of DSPs
and LUTs.
we have developed a user-friendly library that simplifies the
application of this method. The HGQ approach enables the
optimization of quantization bitwidths at arbitrary granularity
up to individual parameter level, through a gradient descent
approach that is conscious of both resource use and loss
minimization. Additionally, the library offers an easy-to-use
interface for defining quantized neural networks and training
them with our method, as well as for deploying these networks
on FPGAs by integrating with hls4ml.
Our findings show that HGQ achieves up to a 95% reduction
in resource consumption compared to leading compression
techniques, without compromising performance. We further
demonstrate that a singular training session with HGQ is
sufficient to explore a broad spectrum of trade-offs between
performance and resource utilization, efficiently recovering
the Pareto frontier, thereby rendering the model optimization
10
TABLE III
R
ESOURCE CONSUMPTION AND LATENCY OF THE
M
UON
T
RACKER MODELS
. T
HE RESOURCE USAGE REPORTED FOR
HGQ
MODELS ARE AFTER PLACE
&
ROUTE
. I
N THIS TASK
,
THE
HGQ-1.25
OUTPERFORMS THE BASELINE
Q
F
6
MODEL IN BOTH ACCURACY AND RESOURCE CONSUMPTION
,
WHILE THE
HGQ-3.00
OUTPERFORMS THE BASELINE
Q
F
5
MODEL IN BOTH ACCURACY AND RESOURCE CONSUMPTION
.
Model
Resolution (mrad)
Latency (cc)
DSP (%)
LUT (%)
FF (%)
BRAM (%)
II (cc)
Qf8 [58]
1.95
17 (106.3 ns)
57.4 (1,762)
8.8 (37,867)
1.0 (8,443)
5.6 (37.5)
1
Qf7 [58]
1.97
11 (68.8 ns)
45.2 (1,389)
8.0 (34,848)
0.6 (5,433)
5.6 (37.5)
1
Qf6 [58]
2.04
13 (81.3 ns)
10.5 (324)
12.6 (54,638)
0.8 (6,525)
5.6 (37.5)
1
Qf5 [58]
2.15
11 (68.8 ns)
2.9 (88)
9.3 (40,039)
0.4 (3,419)
5.6 (37.5)
1
Qf4 [58]
2.45
10 (62.5 ns)
0.8 (24)
6.6 (28,526)
0.3 (2,954)
5.6 (37.5)
1
Qf3 [58]
2.78
9 (56.3 ns)
0.0 (2)
5.0 (21,682)
0.3 (2,242)
5.6 (37.5)
1
HGQ-1
1.95
11 (68.8 ns)
17.0 (522)
9.12 (39,413)
0.70 (6,043)
1.16 (25.0)
1
HGQ-2
2.00
11 (68.8 ns)
5.01 (154)
7.98 (34,460)
0.61 (5,263)
1.16 (25.0)
1
HGQ-3
2.09
12 (75.0 ns)
2.21 (68)
5.77 (24,941)
0.54 (4,677)
1.74 (37.5)
1
HGQ-4
2.20
13 (81.3 ns)
1.33 (41)
4.99 (21,557)
0.54 (4,699)
1.74 (37.5)
1
HGQ-5
2.39
10 (62.5 ns)
0.88 (27)
3.92 (16,918)
0.29 (2,484)
1.74 (37.5)
1
HGQ-6
2.63
12 (75.0 ns)
0.33 (10)
3.08 (13,306)
0.40 (3,429)
1.16 (25.0)
1
process both more efficient and effective. Through its interface
with hls4ml, HGQ provides a bit-accurate conversion from
software to FPGA firmware models without the need for
user interaction, significantly simplifying and streamlining
the workflow from training to deployment. Moreover, we
introduce EBOPs, a metric providing an accurate estimation of
the final on-chip resource consumption as a linear combination
of LUTs and DSPs. This estimation is available at training
time, allowing for efficient software-hardware co-design.
In the future, we plan to extend support for more operations
and layers. We also aim to support other training back-ends,
such as PyTorch [49] and JAX [64]. Furthermore, we plan
to include energy estimates as well as and more fine-grained
resource estimations in the library.
VII. C
ODE
A
VAILABILITY
We have made our library publicly available under the
Apache 2.0 license at https://www.github.com/calad0i/HGQ.
The scripts to reproduce the results in this paper are also avail-
able at https://www.github.com/calad0i/HGQ-demos under the
Apache 2.0 license.
To use this library, one needs a forked version of
hls4ml available at https://www.github.com/calad0i/hls4ml#
HGQ-integration. The fork will be merged into the main
hls4ml repository in the future, and one may check https:
//github.com/fastmachinelearning/hls4ml/pull/914 for the pull
request status.
VIII. D
ATA
A
VAILABILITY
The data used for training and evaluation in this work are
all publicly available datasets. The jet tagging dataset is avail-
able at https://dx.doi.org/10.5281/zenodo.2603255. The SVHN
dataset is available at http://ufldl.stanford.edu/housenumbers/.
The muon tracking dataset is available at https://dx.doi.org/10.
57967/hf/2084. Results shown in this work can be reproduced
using the code available at https://www.github.com/calad0i/
HGQ-demos.
IX. A
UTHOR CONTRIBUTIONS
C.S. conceived, designed, and implemented the HGQ
method and library and performed the experiments. C.S. and
V.C. implemented HGQ support in hls4ml. C.S. and T.A.
wrote the manuscript. All authors reviewed and edited the
manuscript.
X. A
CKNOWLEDGEMENTS
C.S. is supported by the Caltech Danny Koh grad fel-
lowship. C.S. and M.S. acknowledge support from the U.S.
Department of Energy (DOE), Office of Science, Office of
High Energy Physics grant DE-SC0011925. T.
̊
A. is sup-
ported by the Swiss National Science Foundation Grant
No. PZ00P2
201594. J.N. is supported by the U.S. Depart-
ment of Energy (DOE), Office of Science, Office of High
Energy Physics “Designing efficient edge AI with physics phe-
nomena” Project (DE-FOA-0002705). V.L. is supported by the
NSF Institute for Accelerated AI Algorithms for Data-Driven
Discovery (A3D3), under the NSF grant #PHY-2117997.
XI. C
OMPETING
I
NTERESTS
The authors declare no competing interests.
R
EFERENCES
[1] Singh, R. & Gill, S. S. Edge ai: A survey.
Internet of Things and Cyber-
Physical Systems
3
, 71–92 (2023). URL https://www.sciencedirect.com/
science/article/pii/S2667345223000196.
[2] Niu, W.
et al.
Grim: A general, real-time deep learning inference
framework for mobile devices based on fine-grained structured weight
sparsity.
IEEE Trans. Pattern Anal. Mach. Intell.
44
, 6224–6239 (2022).
URL https://doi.org/10.1109/TPAMI.2021.3089687.
[3] Huang, K. & Gao, W. Real-time neural network inference on extremely
weak devices: agile offloading with explainable ai.
In
Proceedings
of the 28th Annual International Conference on Mobile Computing
And Networking
, MobiCom ’22, 200–213 (Association for Computing
Machinery, New York, NY, USA, 2022). URL https://doi.org/10.1145/
3495243.3560551.
[4] Yang, Y.
et al.
Streamvc: Real-time low-latency voice conversion (2024).
URL https://google-research.github.io/seanet/stream
vc/.
[5] The CMS Collaboration. The Phase-2 Upgrade of the CMS Level-1
Trigger. Tech. Rep., CERN, Geneva (2020). URL https://cds.cern.ch/
record/2714892. Final version.
[6] The ATLAS Collaboration. Technical Design Report for the Phase-II
Upgrade of the ATLAS TDAQ System. Tech. Rep., CERN, Geneva
(2017). URL https://cds.cern.ch/record/2285584.
[7] Zurbano Fernandez, I.
et al.
High-Luminosity Large Hadron Collider
(HL-LHC): Technical design report.
CERN Yellow Reports: Monographs
10/2020
(2020).
11
[8] Menghani, G. Efficient deep learning: A survey on making deep learning
models smaller, faster, and better.
ACM Computing Surveys
55
, 1 – 37
(2021). URL https://api.semanticscholar.org/CorpusID:235446458.
[9] Li, Z., Li, H. & Meng, L.
Model compression for deep neural
networks: A survey.
Computers
12
(2023). URL https://www.mdpi.
com/2073-431X/12/3/60.
[10] Coelho, C. N.
et al.
Automatic heterogeneous quantization of deep
neural networks for low-latency inference on the edge for particle
detectors.
Nature Machine Intelligence
3
, 675–686 (2021).
URL
https://doi.org/10.1038%2Fs42256-021-00356-5.
[11] Ngadiuba, J.
et al.
Compressing deep neural networks on fpgas to
binary and ternary precision with hls4ml.
Machine Learning: Science
and Technology
2
, 015001 (2020).
URL https://dx.doi.org/10.1088/
2632-2153/aba042.
[12] Zhou, S.
et al.
Dorefa-net: Training low bitwidth convolutional neural
networks with low bitwidth gradients.
CoRR
abs/1606.06160
(2016).
URL http://arxiv.org/abs/1606.06160. 1606.06160.
[13] Lin, X., Zhao, C. & Pan, W.
Towards accurate binary convolu-
tional neural network. In Guyon, I.
et al.
(eds.)
Advances in Neu-
ral Information Processing Systems
, vol. 30 (Curran Associates, Inc.,
2017). URL https://proceedings.neurips.cc/paper
files/paper/2017/file/
b1a59b315fc9a3002ce38bbe070ec3f5-Paper.pdf.
[14] Courbariaux, M., Bengio, Y. & David, J.
Binaryconnect: Training
deep neural networks with binary weights during propagations.
CoRR
abs/1511.00363
(2015). URL http://arxiv.org/abs/1511.00363. 1511.
00363.
[15] Rastegari, M., Ordonez, V., Redmon, J. & Farhadi, A.
Xnor-net:
Imagenet classification using binary convolutional neural networks. In
Leibe, B., Matas, J., Sebe, N. & Welling, M. (eds.)
Computer Vision –
ECCV 2016
, 525–542 (Springer International Publishing, Cham, 2016).
[16] Li, F., Liu, B., Wang, X., Zhang, B. & Yan, J. Ternary weight networks
(2022). 1605.04711.
[17] Zhu, C., Han, S., Mao, H. & Dally, W. J. Trained ternary quantization
(2017). 1612.01064.
[18] He, Z. & Fan, D. Simultaneously optimizing weight and quantizer of
ternary neural network using truncated gaussian approximation.
2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR)
11430–11438 (2018).
[19] Xu, C.
et al.
Alternating multi-bit quantization for recurrent neural
networks.
CoRR
abs/1802.00150
(2018). URL http://arxiv.org/abs/1802.
00150. 1802.00150.
[20] Guo, Y., Yao, A., Zhao, H. & Chen, Y. Network sketching: Exploiting
binary structure in deep cnns.
CoRR
abs/1706.02021
(2017). URL
http://arxiv.org/abs/1706.02021. 1706.02021.
[21] Zhang, D., Yang, J., Ye, D. & Hua, G. Lq-nets: Learned quantiza-
tion for highly accurate and compact deep neural networks.
CoRR
abs/1807.10029
(2018). URL http://arxiv.org/abs/1807.10029. 1807.
10029.
[22] Qu, Z., Zhou, Z., Cheng, Y. & Thiele, L.
Adaptive loss-aware
quantization for multi-bit networks.
CoRR
abs/1912.08883
(2019). URL
http://arxiv.org/abs/1912.08883. 1912.08883.
[23] Chang, S.-E.
et al.
Mix and match: A novel fpga-centric deep
neural network quantization framework. In
2021 IEEE International
Symposium on High-Performance Computer Architecture (HPCA)
, 208–
220 (2021).
[24] Wang, K., Liu, Z., Lin, Y., Lin, J. & Han, S.
Hardware-centric
automl for mixed-precision quantization.
International Journal of
Computer Vision
128
, 2035–2048 (2020). URL https://doi.org/10.1007/
s11263-020-01339-6.
[25] Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W. & Keutzer, K. HAWQ:
hessian aware quantization of neural networks with mixed-precision.
CoRR
abs/1905.03696
(2019).
URL http://arxiv.org/abs/1905.03696.
1905.03696.
[26] Dong, Z.
et al.
HAWQ-V2: hessian aware trace-weighted quantization
of neural networks.
CoRR
abs/1911.03852
(2019). URL http://arxiv.
org/abs/1911.03852. 1911.03852.
[27] Yao, Z., Gholami, A., Keutzer, K. & Mahoney, M. W. Pyhessian: Neural
networks through the lens of the hessian.
2020 IEEE International
Conference on Big Data (Big Data)
581–590 (2019). URL https://api.
semanticscholar.org/CorpusID:209376531.
[28] Choi, J.
et al.
Bridging the accuracy gap for 2-bit quantized neural
networks (QNN).
CoRR
abs/1807.06964
(2018). URL http://arxiv.org/
abs/1807.06964. 1807.06964.
[29] Wu, B.
et al.
Mixed precision quantization of convnets via differentiable
neural architecture search.
CoRR
abs/1812.00090
(2018). URL http:
//arxiv.org/abs/1812.00090. 1812.00090.
[30] Que, Z.
et al.
Metaml: Automating customizable cross-stage design-flow
for deep learning acceleration. In
2023 33rd International Conference
on Field-Programmable Logic and Applications (FPL)
, 248–252 (2023).
[31] Park, E., Yoo, S. & Vajda, P. Value-aware quantization for training and
inference of neural networks. In Ferrari, V., Hebert, M., Sminchisescu,
C. & Weiss, Y. (eds.)
Computer Vision – ECCV 2018
, 608–624 (Springer
International Publishing, Cham, 2018).
[32] Dettmers, T., Lewis, M., Shleifer, S. & Zettlemoyer, L. 8-bit optimizers
via block-wise quantization.
9th International Conference on Learning
Representations, ICLR
(2022).
[33] Dettmers, T.
et al.
Spqr: A sparse-quantized representation for near-
lossless llm weight compression (2023). 2306.03078.
[34] Lou, Q., Guo, F., Kim, M., Liu, L. & Jiang., L. Autoq: Automated
kernel-wise neural network quantization. In
International Conference
on Learning Representations
(2020). URL https://openreview.net/forum?
id=rygfnn4twS.
[35] Sun, M.
et al.
Film-qnn: Efficient fpga acceleration of deep neural
networks with intra-layer, mixed-precision quantization.
Proceedings of
the 2022 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays
(2022). URL https://doi.org/10.1145/3490422.3502364.
[36] Le Cun, Y., Denker, J. S. & Solla, S. A. Optimal brain damage. In
Proceedings of the 2nd International Conference on Neural Information
Processing Systems
, NIPS’89, 598–605 (MIT Press, Cambridge, MA,
USA, 1989).
[37] Hassibi, B., Stork, D. & Wolff, G. Optimal brain surgeon and general
network pruning. In
IEEE International Conference on Neural Networks
,
293–299 vol.1 (1993).
[38] Ramhorst, B., Constantinides, G. A. & Loncar, V. Fpga resource-aware
structured pruning for real-time neural networks (2023). 2308.05170.
[39] Meng, F.
et al.
Pruning filter in filter. In Larochelle, H., Ranzato, M.,
Hadsell, R., Balcan, M. & Lin, H. (eds.)
Advances in Neural Information
Processing Systems
, vol. 33, 17629–17640 (Curran Associates, Inc.,
2020). URL https://proceedings.neurips.cc/paper
files/paper/2020/file/
ccb1d45fb76f7c5a0bf619f979c6cf36-Paper.pdf.
[40] Li, Y.
et al.
Differentiable transportation pruning (2023). 2307.08483.
[41] Zhang, S., Wang, M., Liu, S., Chen, P.-Y. & Xiong, J.
Why lot-
tery ticket wins? a theoretical perspective of sample complexity on
sparse neural networks. In Ranzato, M., Beygelzimer, A., Dauphin,
Y., Liang, P. & Vaughan, J. W. (eds.)
Advances in Neural Informa-
tion Processing Systems
, vol. 34, 2707–2720 (Curran Associates, Inc.,
2021). URL https://proceedings.neurips.cc/paper
files/paper/2021/file/
15f99f2165aa8c86c9dface16fefd281-Paper.pdf.
[42] Vischer, M. A., Lange, R. T. & Sprekeler, H.
On lottery tickets
and minimal task representations in deep reinforcement learning. In
International Conference on Learning Representations
(2022). URL
https://openreview.net/forum?id=Fl3Mg
MZR-.
[43] Frankle, J. & Carbin, M. The lottery ticket hypothesis: Finding sparse,
trainable neural networks. In
International Conference on Learning Rep-
resentations
(2019). URL https://openreview.net/forum?id=rJl-b3RcF7.
[44] Miao, L.
et al.
Learning pruning-friendly networks via frank-wolfe:
One-shot, any-sparsity, and no retraining. In
International Conference
on Learning Representations
(2022). URL https://openreview.net/forum?
id=O1DEtITim
.
[45] Chijiwa, D., Yamaguchi, S. y., Ida, Y., Umakoshi, K. & INOUE,
T.
Pruning randomly initialized neural networks with iterative
randomization.
In Ranzato, M., Beygelzimer, A., Dauphin, Y.,
Liang, P. & Vaughan, J. W. (eds.)
Advances in Neural Information
Processing Systems
, vol. 34, 4503–4513 (Curran Associates, Inc.,
2021). URL https://proceedings.neurips.cc/paper
files/paper/2021/file/
23e582ad8087f2c03a5a31c125123f9a-Paper.pdf.
[46] Abadi, M.
et al.
TensorFlow: Large-scale machine learning on hetero-
geneous systems (2015). URL https://www.tensorflow.org/. Software
available from tensorflow.org.
[47] Fahim, F.
et al.
hls4ml: An open-source codesign workflow to empower
scientific low-power machine learning devices.
CoRR
abs/2103.05579
(2021). URL https://arxiv.org/abs/2103.05579. 2103.05579.
[48] Alessandro, Franco, G., nickfraser, Umuroglu, Y. & vfdev. Xilinx/brevi-
tas: Release version 0.2.1 (2021). URL https://doi.org/10.5281/zenodo.
4507794.
[49] Paszke, A.
et al.
Pytorch: An imperative style, high-performance deep
learning library. In Wallach, H. M., Larochelle, A. P., Beygelzimer, A. P.,
12
d’Alch
́
e Buc, A. P. & Fox, A. P. B. (eds.)
Proceedings of the 33rd
International Conference on Neural Information Processing Systems
(Curran Associates Inc., Red Hook, NY, USA, 2019).
[50] Umuroglu, Y.
et al.
FINN: A framework for fast, scalable binarized
neural network inference. In
Proceedings of the 2017 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays
(ACM
Press, 2017). 1612.07119.
[51] Blott, M.
et al.
FINN-R: An end-to-end deep-learning framework for fast
exploration of quantized neural networks.
ACM Trans. Reconfigurable
Technol. Syst.
11
(2018). 1809.04570.
[52] Bengio, Y., L
́
eonard, N. & Courville, A. C. Estimating or propagating
gradients through stochastic neurons for conditional computation.
CoRR
abs/1308.3432
(2013). URL http://arxiv.org/abs/1308.3432. 1308.3432.
[53] Baskin, C.
et al.
UNIQ.
ACM Transactions on Computer Systems
37
,
1–15 (2019). URL https://doi.org/10.1145%2F3444943.
[54] Elthakeb, A. T.
et al.
Waveq: Gradient-based deep quantization of neural
networks through sinusoidal adaptive regularization (2020). 2003.00146.
[55] Nguyen, H. D., Alexandridis, A. & Mouchtaris, A.
Quantization
aware training with absolute-cosine regularization for automatic speech
recognition. In
Interspeech
(2020). URL https://api.semanticscholar.org/
CorpusID:226203265.
[56] Chollet, F.
et al.
Keras. https://keras.io (2015).
[57] Aarrestad, T.
et al.
Fast convolutional neural networks on fpgas with
hls4ml.
Machine Learning: Science and Technology
2
, 045015 (2021).
URL https://dx.doi.org/10.1088/2632-2153/ac0ea1.
[58] Sun, C., Nakajima, T., Mitsumori, Y., Horii, Y. & Tomoto, M. Fast
muon tracking with machine learning implemented in fpga.
Nuclear
Instruments and Methods in Physics Research Section A: Accelera-
tors, Spectrometers, Detectors and Associated Equipment
1045
, 167546
(2023). URL http://dx.doi.org/10.1016/j.nima.2022.167546.
[59] Pierini, M., Duarte, J. M., Tran, N. & Freytsis, M. Hls4ml lhc jet dataset
(150 particles) (2020). URL https://doi.org/10.5281/zenodo.3602260.
[60] Umuroglu, Y., Akhauri, Y., Fraser, N. J. & Blott, M. Logicnets: Co-
designed neural networks and circuits for extreme-throughput applica-
tions.
2020 30th International Conference on Field-Programmable Logic
and Applications (FPL)
291–297 (2020). URL https://doi.org/10.1109/
FPL50879.2020.00055.
[61] Tsoi, H. F., Loncar, V., Dasu, S. & Harris, P.
Symbolnet: Neural
symbolic regression with adaptive dynamic pruning (2024). 2401.09949.
[62] Netzer, Y.
et al.
Reading digits in natural images with unsupervised
feature learning.
NIPS Workshop on Deep Learning and Unsupervised
Feature Learning
(2011).
[63] LeCun, Y.
et al.
Backpropagation applied to handwritten zip code
recognition.
Neural Computation
1
, 541–551 (1989).
URL https:
//api.semanticscholar.org/CorpusID:41312633.
[64] Bradbury, J.
et al.
JAX: composable transformations of Python+NumPy
programs (2018). URL http://github.com/google/jax.
[65] Tange, O. Gnu parallel 20240122 (’frederik x’) (2023). URL https://doi.
org/10.5281/zenodo.10558745. GNU Parallel is a general parallelizer to
run multiple serial command line programs in parallel without changing
them.
XII. E
XTENDED
D
ATA
A. Architectures and networks
13