H-ViT: hardware-friendly post-training quantization for efficient vision transformer inference

Jing Liu; Jiaqi Lai; Xiaodong Deng; Caigui Jiang; Nanning Zheng

doi:10.1007/s43684-025-00121-0

Autonomous Intelligent Systems ›› 2025, Vol. 5 ›› Issue (1) :32 DOI: 10.1007/s43684-025-00121-0

Original Article

research-article

H-ViT: hardware-friendly post-training quantization for efficient vision transformer inference

Jing Liu ¹^,²^,³
, Jiaqi Lai ¹^,²^,³
, Xiaodong Deng ¹^,²^,³
, Caigui Jiang ¹^,²^,³^,^a
, Nanning Zheng ¹^,²^,³

Author information +

History +

PDF

Abstract

Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision tasks. However these models are memory-consuming and computation-intensive, making their deployment and efficient inference on edge devices challenging. Model quantization is a promising approach to reduce model complexity. Prior works have explored tailored quantization algorithms for ViTs but unfortunately retained floating-point (FP) scaling factors, which not only yield non-negligible re-quantization overhead, but also hinder the quantized models to perform efficient integer-only inference. In this paper, we propose H-ViT, a dedicated post-training quantization scheme (e.g., symmetric uniform quantization and layer-wise quantization for both weights and part of activations) to effectively quantize ViTs with fewer Power-of-Two (PoT) scaling factors, thus minimizing the re-quantization overhead and memory consumption. In addition, observing serious inter-channel variation in LayerNorm inputs and outputs, we propose Power-of-Two quantization (PTQ), a systematic method to reducing the performance degradation without hyper-parameters. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that H-ViT offers comparable(or even slightly higher) INT8 quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. For instance, we reach 78.43 top-1 accuracy with DeiT-S on ImageNet, 51.6 box AP and 44.8 mask AP with Cascade Mask R-CNN (Swin-B) on COCO.

Keywords

Vision Transformers / Post-training quantization / Power-of-Two scaling factors / Hardware deployment

Cite this article

Download citation ▾

Jing Liu, Jiaqi Lai, Xiaodong Deng, Caigui Jiang, Nanning Zheng. H-ViT: hardware-friendly post-training quantization for efficient vision transformer inference. Autonomous Intelligent Systems, 2025, 5(1): 32 DOI:10.1007/s43684-025-00121-0

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]

Dosovitskiy A., Beyer L., Kolesnikov A., Weissenborn D., Zhai X., Unterthiner T., Dehghani M., Minderer M., Heigold G., Gelly S., Uszkoreit J., Houlsby N.. An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations, 2021https://openreview.net/forum?id=YicbFdNTTy

[2]	Liu Z., Lin Y., Cao Y., Hu H., Wei Y., Zhang Z., Lin S., Guo B.. Swin transformer: hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021

[3]	Carion N., Massa F., Synnaeve G., Usunier N., Kirillov A., Zagoruyko S.. End-to-end object detection with transformers. Computer Vision – ECCV 2020, 2020213-229.

[4]	X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable detr: deformable transformers for end-to-end object detection. arXiv: Computer Vision and Pattern Recognition 2020)

[5]	Strudel R., Garcia R., Laptev I., Schmid C.. Segmenter: transformer for semantic segmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021

[6]	Han K., Wang Y., Chen H., Chen X., Guo J., Liu Z., Tang Y., Xiao A., Xu C., Xu Y., Yang Z., Zhang Y., Tao D.. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell., 2023, 45(1): 87-110.

[7]	Hou Z., Kung S.-Y.. Multi-dimensional vision transformer compression via dependency guided Gaussian process search. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 20223668-3677.

[8]	Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, D. Tao, Patch Slimming for Efficient Vision Transformers (2021) arXiv e-prints, 2106–02852

[9]	R. Krishnamoorthi, Quantizing deep convolutional networks for efficient inference: a whitepaper (2018). arXiv preprint arXiv:1806.08342

[10]	J. Choi, Z. Wang, S. Venkataramani, P. I-Jen Chuang, V. Srinivasan, K. Gopalakrishnan, PACT: parameterized Clipping Activation for Quantized Neural Networks (2018) arXiv e-prints, 1805–06085

[11]	S.K. Esser, J.L. McKinstry, D. Bablani, R. Appuswamy, D.S. Modha, Learned step size quantization (2019). arXiv preprint arXiv:1902.08153

[12]	Li Z., Ma L., Long X., Xiao J., Gu Q.. Dual-discriminator adversarial framework for data-free quantization. Neurocomputing, 2022, 511: 67-77.

[13]	X. Wei, R. Gong, Y. Li, X. Liu, F. Yu, Qdrop: randomly dropping quantization for extremely low-bit post-training quantization (2022). arXiv preprint arXiv:2203.05740

[14]	Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, S. Gu, Brecq: pushing the limit of post-training quantization by block reconstruction (2021). arXiv preprint arXiv:2102.05426

[15]	Nagel M., Amjad R.A., Van Baalen M., Louizos C., Blankevoort T.. Up or down? Adaptive rounding for post-training quantization. International Conference on Machine Learning, 20207197-7206

[16]	Wang P., Chen Q., He X., Cheng J.. Towards accurate post-training network quantization via bit-split and stitching. International Conference on Machine Learning, 20209847-9856

[17]	Jacob B., Kligys S., Chen B., Zhu M., Tang M., Howard A., Adam H., Kalenichenko D.. Quantization and training of neural networks for efficient integer-arithmetic-only inference. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

[18]	Z. Yao, Z. Dong, Z. Zheng, A. Gholami, J. Yu, E. Tan, L. Wang, Q. Huang, Y. Wang, M.W. Mahoney, K. Keutzer, HAWQV3: dyadic Neural Network Quantization (2020) arXiv e-prints, 2011–10680

[19]	Li Z., Xiao J., Yang L., Gu Q.. Repq-vit: scale reparameterization for post-training quantization of vision transformers. Proceedings of the IEEE/CVF International Conference on Computer Vision, 202317227-17236

[20]	Lin Y., Zhang T., Sun P., Li Z., Zhou S.. Fq-vit: post-training quantization for fully quantized vision transformer. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, 20221173-1179

[21]	Z. Yuan, C. Xue, Y. Chen, Q. Wu, G. Sun, Ptq4vit: post-training quantization framework for vision transformers (2022). arXiv preprint arXiv:2111.12293

[22]	Z. Liu, Y. Wang, K. Han, S. Ma, W. Gao, Post-Training Quantization for Vision Transformer. arXiv e-prints, 2106–14156 (2021)

[23]	Liu Z., Wang Y., Han K., Zhang W., Ma S., Gao W.. Post-training quantization for vision transformer. Adv. Neural Inf. Process. Syst., 2021, 34: 28092-28103

[24]	Liu Y., Yang H., Dong Z., Keutzer K., Du L., Zhang S.. Noisyquant: noisy bias-enhanced post-training activation quantization for vision transformers. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 202320321-20330

[25]	Huang M., Luo J., Ding C., Wei Z., Huang S., Yu H.. An integer-only and group-vector systolic accelerator for efficiently mapping vision transformer on edge. IEEE Trans. Circuits Syst. I, Regul. Pap., 2023, 70(12): 5289-5301.

[26]	H. Yao, P. Li, J. Cao, X. Liu, C. Xie, B. Wang, Rapq: rescuing accuracy for power-of-two low-bit post-training quantization (2022). arXiv preprint arXiv:2204.12322

[27]	Jain S., Gural A., Wu M., Dick C.. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. Proc. Mach. Learn. Syst., 2020, 2: 112-128

[28]	Li Z., Gu Q.. I-vit: integer-only quantization for efficient vision transformer inference. Proceedings of the IEEE/CVF International Conference on Computer Vision, 202317065-17075

[29]	Jacob B., Kligys S., Chen B., Zhu M., Tang M., Howard A., Adam H., Kalenichenko D.. Quantization and training of neural networks for efficient integer-arithmetic-only inference. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 20182704-2713

[30]	Shi H., Cheng X., Mao W., Wang Z.. P2-vit: power-of-two post-training quantization and acceleration for fully quantized vision transformer. IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2024, 32(9): 1704-1717.

[31]	Touvron H., Cord M., Douze M., Massa F., Sablayrolles A., J’egou H.. Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning, 2020

[32]	Li Z., Chen M., Xiao J., Gu Q.. Psaq-vit v2: toward accurate and general data-free quantization for vision transformers. IEEE Trans. Neural Netw. Learn. Syst., 2024, 35(12): 17227-17238.

[33]	Choukroun Y., Kravchik E., Yang F., Kisilev P.. Low-bit quantization of neural networks for efficient inference. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 20193009-3018.

[34]	Deng J., Dong W., Socher R., Li L.-J., Li K., Fei-Fei L.. Imagenet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009248-255.

[35]	Li R., Wang Y., Liang F., Qin H., Yan J., Fan R.. Fully quantized network for object detection. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20192810-2819

[36]	D. Wu, Q. Tang, Y. Zhao, M. Zhang, Y. Fu, D. Zhang, Easyquant: post-training quantization via scale optimization (2020). arXiv preprint arXiv:2006.16669

[37]	Ding Y., Qin H., Yan Q., Chai Z., Liu J., Wei X., Liu X.. Towards accurate post-training quantization for vision transformer. Proceedings of the 30th ACM International Conference on Multimedia, 20225380-5388.

[38]	Lin T.-Y., Maire M., Belongie S., Hays J., Perona P., Ramanan D., Dollár P., Zitnick C.L.. Microsoft coco: common objects in context. Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, 2014740-755. 13