Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator

Yulong ZHAO , Chunzhi WU , Yizhuo WANG , Lufei ZHANG , Yaguang ZHANG , Wenyuan SHEN , Hao FAN , Hankang FANG , Yi QIN , Xin LIU

Front. Inform. Technol. Electron. Eng ›› 2025, Vol. 26 ›› Issue (4) : 605 -622.

PDF (5611KB)
Front. Inform. Technol. Electron. Eng ›› 2025, Vol. 26 ›› Issue (4) : 605 -622. DOI: 10.1631/FITEE.2400453

Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator

Author information +
History +
PDF (5611KB)

Abstract

Transformer models have become a cornerstone of various natural language processing (NLP) tasks. However, the substantial computational overhead during the inference remains a significant challenge, limiting their deployment in practical applications. In this study, we address this challenge by minimizing the inference overhead in transformer models using the controlling element on artificial intelligence (AI) accelerators. Our work is anchored by four key contributions. First, we conduct a comprehensive analysis of the overhead composition within the transformer inference process, identifying the primary bottlenecks. Second, we leverage the management processing element (MPE) of the Shenwei AI (SWAI) accelerator, implementing a three-tier scheduling framework that significantly reduces the number of host-device launches to approximately 1/10 000 of the original PyTorch-GPU setup. Third, we introduce a zero-copy memory management technique using segment-page fusion, which significantly reduces memory access latency and improves overall inference efficiency. Finally, we develop a fast model loading method that eliminates redundant computations during model verification and initialization, reducing the total loading time for large models from 22 128.31 ms to 1041.72 ms. Our contributions significantly enhance the optimization of transformer models, enabling more efficient and expedited inference processes on AI accelerators.

Keywords

Transformer inference optimization / Three-tier scheduling / Zero-copy memory management / Fast model loading

Cite this article

Download citation ▾
Yulong ZHAO, Chunzhi WU, Yizhuo WANG, Lufei ZHANG, Yaguang ZHANG, Wenyuan SHEN, Hao FAN, Hankang FANG, Yi QIN, Xin LIU. Minimizing transformer inference overhead using controlling element on Shenwei AI accelerator. Front. Inform. Technol. Electron. Eng, 2025, 26(4): 605-622 DOI:10.1631/FITEE.2400453

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Zhejiang University Press

AI Summary AI Mindmap
PDF (5611KB)

139

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/