Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip

Creators: Chang, Sun^{1, 2}; Årrestad, Thea; Lončar, Vladimir; Ngadiuba, Jennifer; Spiropulu, Maria¹

1. California Institute of Technology
2. ETH Zurich

Style

An error occurred while generating the citation.

Abstract

Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method designed to fine-tune the per-weight and per-activation precision in an automatic way for ultra-low latency and low power neural networks which are to be deployed on FPGAs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.

Code Availability

We have made our library publicly available under the Apache 2.0 license at https://www.github.com/calad0i/HGQ. The scripts to reproduce the results in this paper are also available at https://www.github.com/calad0i/HGQ-demos under the Apache 2.0 license.

Data Availability

The data used for training and evaluation in this work are all publicly available datasets. The jet tagging dataset is available at https://dx.doi.org/10.5281/zenodo.2603255. The SVHN dataset is available at http://ufldl.stanford.edu/housenumbers/. The muon tracking dataset is available at https://dx.doi.org/10.57967/hf/2084. Results shown in this work can be reproduced
using the code available at https://www.github.com/calad0i/HGQ-demos.

Acknowledgement

C.S. is partially supported by the Caltech Danny Koh grad fellowship. C.S. acknowledges partial support from Gunther Dissertori. C.S. and M.S. acknowledge partial support from the U.S. Department of Energy (DOE), Office of Science,
Office of High Energy Physics grant DE-SC0011925. T.Å . is supported by the Swiss National Science Foundation Grant No. PZ00P2 201594. J.N., M.S., and C.S. are partially supported by the U.S. Department of Energy (DOE), Office of Science, Office of High Energy Physics “Designing efficient edge AI with physics phenomena” Project (DEFOA0002705). J.N. is partially supported by the AI2050 program at Schmidt
Futures (Grant G-23-64934). V.L. is supported by the NSF Institute for Accelerated AI Algorithms for Data-Driven Discovery (A3D3), under the NSF grant #PHY-2117997.

Additional Information

C.S. conceived, designed, and implemented the HGQ method and library and performed the experiments. C.S. and V.C. implemented HGQ support in hls4ml. C.S. and T.A. wrote the manuscript. All authors reviewed and edited the manuscript.

Conflict of Interest

The authors declare no competing interests.

Attached Files

Name	Size	Download all
2405.00645v2.pdf md5:f25cec7eb69ffabc397aa03684b3852d	668.4 kB	Preview Download
HGQ_NML.pdf md5:bb75e27967def880579c12433a65db90	504.6 kB	Preview Download
2405.00645v1.pdf md5:2c03b9dc853eb65731d12674e2e2c95b	615.4 kB	Preview Download

	All versions	This version
Views	0	0
Downloads	0	0
Data volume	0 Bytes	0 Bytes