H-ViT: hardware-friendly post-training quantization for efficient vision transformer inference
Jing Liu , Jiaqi Lai , Xiaodong Deng , Caigui Jiang , Nanning Zheng
Autonomous Intelligent Systems ›› 2025, Vol. 5 ›› Issue (1) : 32
H-ViT: hardware-friendly post-training quantization for efficient vision transformer inference
Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision tasks. However these models are memory-consuming and computation-intensive, making their deployment and efficient inference on edge devices challenging. Model quantization is a promising approach to reduce model complexity. Prior works have explored tailored quantization algorithms for ViTs but unfortunately retained floating-point (FP) scaling factors, which not only yield non-negligible re-quantization overhead, but also hinder the quantized models to perform efficient integer-only inference. In this paper, we propose H-ViT, a dedicated post-training quantization scheme (e.g., symmetric uniform quantization and layer-wise quantization for both weights and part of activations) to effectively quantize ViTs with fewer Power-of-Two (PoT) scaling factors, thus minimizing the re-quantization overhead and memory consumption. In addition, observing serious inter-channel variation in LayerNorm inputs and outputs, we propose Power-of-Two quantization (PTQ), a systematic method to reducing the performance degradation without hyper-parameters. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that H-ViT offers comparable(or even slightly higher) INT8 quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. For instance, we reach 78.43 top-1 accuracy with DeiT-S on ImageNet, 51.6 box AP and 44.8 mask AP with Cascade Mask R-CNN (Swin-B) on COCO.
Vision Transformers / Post-training quantization / Power-of-Two scaling factors / Hardware deployment
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: deformable transformers for end-to-end object detection. arXiv: Computer Vision and Pattern Recognition 2020) |
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, D. Tao, Patch Slimming for Efficient Vision Transformers (2021) arXiv e-prints, 2106–02852 |
| [9] |
R. Krishnamoorthi, Quantizing deep convolutional networks for efficient inference: a whitepaper (2018). arXiv preprint arXiv:1806.08342 |
| [10] |
J. Choi, Z. Wang, S. Venkataramani, P. I-Jen Chuang, V. Srinivasan, K. Gopalakrishnan, PACT: parameterized Clipping Activation for Quantized Neural Networks (2018) arXiv e-prints, 1805–06085 |
| [11] |
S.K. Esser, J.L. McKinstry, D. Bablani, R. Appuswamy, D.S. Modha, Learned step size quantization (2019). arXiv preprint arXiv:1902.08153 |
| [12] |
|
| [13] |
X. Wei, R. Gong, Y. Li, X. Liu, F. Yu, Qdrop: randomly dropping quantization for extremely low-bit post-training quantization (2022). arXiv preprint arXiv:2203.05740 |
| [14] |
Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, S. Gu, Brecq: pushing the limit of post-training quantization by block reconstruction (2021). arXiv preprint arXiv:2102.05426 |
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
Z. Yao, Z. Dong, Z. Zheng, A. Gholami, J. Yu, E. Tan, L. Wang, Q. Huang, Y. Wang, M.W. Mahoney, K. Keutzer, HAWQV3: dyadic Neural Network Quantization (2020) arXiv e-prints, 2011–10680 |
| [19] |
|
| [20] |
|
| [21] |
Z. Yuan, C. Xue, Y. Chen, Q. Wu, G. Sun, Ptq4vit: post-training quantization framework for vision transformers (2022). arXiv preprint arXiv:2111.12293 |
| [22] |
Z. Liu, Y. Wang, K. Han, S. Ma, W. Gao, Post-Training Quantization for Vision Transformer. arXiv e-prints, 2106–14156 (2021) |
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
H. Yao, P. Li, J. Cao, X. Liu, C. Xie, B. Wang, Rapq: rescuing accuracy for power-of-two low-bit post-training quantization (2022). arXiv preprint arXiv:2204.12322 |
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
D. Wu, Q. Tang, Y. Zhao, M. Zhang, Y. Fu, D. Zhang, Easyquant: post-training quantization via scale optimization (2020). arXiv preprint arXiv:2006.16669 |
| [37] |
|
| [38] |
|
| [39] |
AMD/Xilinx, Versal acap ai engine architecture manual (am009) (2021). https://docs.amd.com/r/en-US/am009-versal-ai-engine/Revision-History |
The Author(s)
/
| 〈 |
|
〉 |