Gradient-based Automatic Per-Weight Mixed Precision Quantization for Neural Networks On-Chip
Abstract
Model size and inference speed at deployment time, are major challenges in many deep learning applications. A promising strategy to overcome these challenges is quantization. However, a straightforward uniform quantization to very low precision can result in significant accuracy loss. Mixed-precision quantization, based on the idea that certain parts of the network can accommodate lower precision without compromising performance compared to other parts, offers a potential solution. In this work, we present High Granularity Quantization (HGQ), an innovative quantization-aware training method designed to fine-tune the per-weight and per-activation precision in an automatic way for ultra-low latency and low power neural networks which are to be deployed on FPGAs. We demonstrate that HGQ can outperform existing methods by a substantial margin, achieving resource reduction by up to a factor of 20 and latency improvement by a factor of 5 while preserving accuracy.
Code Availability
We have made our library publicly available under the Apache 2.0 license at https://www.github.com/calad0i/HGQ. The scripts to reproduce the results in this paper are also available at https://www.github.com/calad0i/HGQ-demos under the Apache 2.0 license.
Data Availability
The data used for training and evaluation in this work are all publicly available datasets. The jet tagging dataset is available at https://dx.doi.org/10.5281/zenodo.2603255. The SVHN dataset is available at http://ufldl.stanford.edu/housenumbers/. The muon tracking dataset is available at https://dx.doi.org/10.57967/hf/2084. Results shown in this work can be reproduced
using the code available at https://www.github.com/calad0i/HGQ-demos.
Acknowledgement
C.S. is partially supported by the Caltech Danny Koh grad fellowship. C.S. acknowledges partial support from Gunther Dissertori. C.S. and M.S. acknowledge partial support from the U.S. Department of Energy (DOE), Office of Science,
Office of High Energy Physics grant DE-SC0011925. T.Å . is supported by the Swiss National Science Foundation Grant No. PZ00P2 201594. J.N., M.S., and C.S. are partially supported by the U.S. Department of Energy (DOE), Office of Science, Office of High Energy Physics “Designing efficient edge AI with physics phenomena” Project (DEFOA0002705). J.N. is partially supported by the AI2050 program at Schmidt
Futures (Grant G-23-64934). V.L. is supported by the NSF Institute for Accelerated AI Algorithms for Data-Driven Discovery (A3D3), under the NSF grant #PHY-2117997.
Additional Information
C.S. conceived, designed, and implemented the HGQ method and library and performed the experiments. C.S. and V.C. implemented HGQ support in hls4ml. C.S. and T.A. wrote the manuscript. All authors reviewed and edited the manuscript.
Conflict of Interest
The authors declare no competing interests.
Attached Files
Discussion paper in ArXiv: 2405.00645v1.pdf
Paper submitted for publication: HGQ_NML.pdf
Files
Additional details
- United States Department of Energy
- Accomplishments and Future Goals of Experimental Particle Physics at Caltech DE-SC0011925
- Swiss National Science Foundation
- PZ00P2 201594
- Office of High Energy Physics
- Designing efficient 860 edge AI with physics phenomena” Project DEFOA0002705
- Division of Physics
- PHY-2117997
- Submitted
-
2024-05-01Submitted paper
- Updated
-
2024-08-04Updated paper
- Publication Status
- Submitted