Accelerating BERT inference with GPU-efficient exit prediction

Lei LI; Chengyu WANG; Minghui QIU; Cen CHEN; Ming GAO; Aoying ZHOU

doi:10.1007/s11704-022-2341-9

Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (3) : 183308 DOI: 10.1007/s11704-022-2341-9

Artificial Intelligence

RESEARCH ARTICLE

Accelerating BERT inference with GPU-efficient exit prediction

Author information +

History +

PDF (14533KB)

Abstract

BERT is a representative pre-trained language model that has drawn extensive attention for significant improvements in downstream Natural Language Processing (NLP) tasks. The complex architecture and massive parameters bring BERT competitive performance but also result in slow speed at model inference time. To speed up BERT inference, FastBERT realizes adaptive inference with an acceptable drop in accuracy based on knowledge distillation and the early-exit technique. However, many factors may limit the performance of FastBERT, such as the teacher classifier that is not knowledgeable enough, the batch size shrinkage and the redundant computation of student classifiers. To overcome these limitations, we propose a new BERT inference method with GPU-Efficient Exit Prediction (GEEP). GEEP leverages the shared exit loss to simplify the training process of FastBERT from two steps into only one step and makes the teacher classifier more knowledgeable by feeding diverse Transformer outputs to the teacher classifier. In addition, the exit layer prediction technique is proposed to utilize a GPU hash table to handle the token-level exit layer distribution and to sort test samples by predicted exit layers. In this way, GEEP can avoid batch size shrinkage and redundant computation of student classifiers. Experimental results on twelve public English and Chinese NLP datasets prove the effectiveness of the proposed approach. The source codes of GEEP will be released to the public upon paper acceptance.

Graphical abstract

Keywords

BERT / FastBERT / inference acceleration / model distillation / early exit / text classification

Cite this article

Download citation ▾

Lei LI, Chengyu WANG, Minghui QIU, Cen CHEN, Ming GAO, Aoying ZHOU. Accelerating BERT inference with GPU-efficient exit prediction. Front. Comput. Sci., 2024, 18(3): 183308 DOI:10.1007/s11704-022-2341-9

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, 4171−4186

[2]	Radford A, Narasimhan K. Improving language understanding by generative pre-training. See cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf website. 2018

[3]	Yang Z, Dai Z, Yang Y, Carbonell J G, Salakhutdinov R, Le Q. XLNet: generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 517

[4]	Gou J, Yu B, Maybank S J, Tao D . Knowledge distillation: a survey. International Journal of Computer Vision, 2021, 129( 6): 1789–1819

[5]	Laskaridis S, Kouris A, Lane N D. Adaptive inference through early-exit networks: design, challenges and directions. In: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning. 2021, 1−6

[6]	Liu W, Zhou P, Wang Z, Zhao Z, Deng H, Ju Q. FastBERT: a self-distilling BERT with adaptive inference time. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 6035−6044

[7]	Wang C, Qiu M, Zhang T, Liu T, Li L, Wang J, Wang M, Huang J, Lin W. EasyNLP: A comprehensive and easy-to-use toolkit for natural language processing. 2022, arXiv preprint arXiv: 2205.00258

[8]	Wang C, Qiu M, Huang J. Building natural language processing applications with EasyNLP. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2022, 5100−5101

[9]	Buciluă C, Caruana R, Niculescu-Mizil A. Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 535−541

[10]	Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015, arXiv preprint arXiv: 1503.02531

[11]	Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 2019, arXiv preprint arXiv: 1910.01108

[12]	Zhang L, Song J, Gao A, Chen J, Bao C, Ma K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 2019, 3712−3721

[13]

Berestizshevsky K, Even G. Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Proceedings of the Artificial Neural Networks and Machine Learning-ICANN 2019: Deep Learning: the 28th International Conference on Artificial Neural Networks. 2019, 306−320

[14]	Gormez A, Koyuncu E. Class means as an early exit decision mechanism. 2021, arXiv preprint arXiv: 2103.01148v1

[15]	Jiang H, Kim B, Guan M Y, Gupta M. To trust or not to trust a classifier. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 5546−5557

[16]	Zhou W, Xu C, Ge T, McAuley J J, Xu K, Wei F. BERT loses patience: fast and robust inference with early exit. In: Proceedings of the Conference on Neural Information Processing Systems. 2020, 18330−18341

[17]	Sun T, Liu X, Zhu W, Geng Z, Wu L, He Y, Ni Y, Xie G, Huang X, Qiu X. A simple hash-based early exiting approach for language understanding and generation. In: Proceedings of Findings of the Association for Computational Linguistics: ACL 2022. 2022, 2409−2421

[18]	Lessley B, Childs H . Data-parallel hashing techniques for GPU architectures. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 1): 237–250

[19]	Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. 3rd ed. Massachusetts: The MIT Press, 2009

[20]	Bordawekar R. Evaluation of parallel hashing techniques. In: Proceedings (Findings) of the GPU Technology Conference. See on-demand.gputechconf.com/gtc/2014/presentations/S4507-evaluation-of-parallel-hashing-techniques.pdf website. 2014, 1−27

[21]	Pagh R, Rodler F F . Cuckoo hashing. Journal of Algorithms, 2004, 51( 2): 122–144

[22]	Breslow A D, Jayasena N S . Morton filters: faster, space-efficient cuckoo filters via biasing, compression, and decoupled logical sparsity. Proceedings of the VLDB Endowment, 2018, 11( 9): 1041–1055

[23]	Alipourfard O, Moshref M, Zhou Y, Yang T, Yu M. A comparison of performance and accuracy of measurement algorithms in software. In: Proceedings of the Symposium on SDN Research. 2018, 18

[24]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010

[25]

Voita E, Sennrich R, Titov I. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019, 4396−4406

[26]	Xiong R, Yang Y, He D, Zheng K, Zheng S, Xing C, Zhang H, Lan Y, Wang L, Liu T. On layer normalization in the transformer architecture. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 10524−10533

[27]	Cover T M, Thomas J A. Elements of Information Theory. 2nd ed. Hoboken: John Wiley & Sons, Inc., 2006, 57−58

[28]	Liu X, Chen Q, Deng C, Zeng H, Chen J, Li D, Tang B. LCQMC: A large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics. 2018, 1952−1962

[29]	Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 649−657

[30]	Jiao X, Yin Y, Shang L, Jiang X, Chen X, Li L, Wang F, Liu Q. TinyBERT: distilling BERT for natural language understanding. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2020. 2020, 4163−4174

[31]	Chen X, He B, Hui K, Sun L, Sun Y. Simplified tinyBERT: Knowledge distillation for document retrieval. In: Proceedings of the 43rd European Conference on Information Retrieval. 2021, 241−248

[32]	Li L, Lin Y, Chen D, Ren S, Li P, Zhou J, Sun X. CascadeBERT: Accelerating inference of pre-trained language models via calibrated complete models cascade. In: Proceedings of Findings of the Association for Computational Linguistics: EMNLP 2021. 2021, 475−486

[33]	Sun T, Zhou Y, Liu X, Zhang X, Jiang H, Cao Z, Huang X, Qiu X. Early exiting with ensemble internal classifiers. 2021, arXiv preprint arXiv: 2105.13792

[34]	Zhu W. LeeBERT: Learned Early Exit for BERT with cross-level optimization. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021, 2968−2980

[35]	Ji X, Tang R, Lee J, Yu Y, Lin J. DeeBERT: dynamic early exiting for accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 2246−2251

[36]	Wang A, Singh A, Michael J, Hill F, Levy O, Bowman S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 2018, 353−355