Towards optimized tensor code generation for deep learning on sunway many-core processor

Mingzhen LI, Changxi LIU, Jianjin LIAO, Xuegui ZHENG, Hailong YANG, Rujun SUN, Jun XU, Lin GAN, Guangwen YANG, Zhongzhi LUAN, Depei QIAN

PDF(8439 KB)
PDF(8439 KB)
Front. Comput. Sci. ›› 2024, Vol. 18 ›› Issue (2) : 182101. DOI: 10.1007/s11704-022-2440-7
Architecture
RESEARCH ARTICLE

Towards optimized tensor code generation for deep learning on sunway many-core processor

Author information +
History +

Abstract

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM achieve 1.79× improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway, across eight representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor.

Graphical abstract

Keywords

sunway processor / deep learning compiler / code generation / performance optimization

Cite this article

Download citation ▾
Mingzhen LI, Changxi LIU, Jianjin LIAO, Xuegui ZHENG, Hailong YANG, Rujun SUN, Jun XU, Lin GAN, Guangwen YANG, Zhongzhi LUAN, Depei QIAN. Towards optimized tensor code generation for deep learning on sunway many-core processor. Front. Comput. Sci., 2024, 18(2): 182101 https://doi.org/10.1007/s11704-022-2440-7

Mingzhen Li is a PhD student in School of Computer Science and Engineering, Beihang University, China. He is currently working on identifying performance opportunities for scientific applications. His research interests include deep learning system, performance optimization, and code generation

Changxi Liu is a PhD student at the National University of Singapore, Singapore. He received a Master’s and Bachelor’s degree in computer science both from Beihang University, China. His research interests include simulation, compilers, high-performance computing, and computer architecture exploration

Jianjin Liao is a master student in School of Computer Science and Engineering, Beihang University, China. He is currently working on performance optimization for scientific applications and deep learning. His research interests include performance optimization and deep learning compilation optimization

Xuegui Zheng is a master student in School of Computer Science and Engineering, Beihang University, China. His research interests include compiler and performance optimization. He received the Bachelor’s degree in computer science and technology from Fuzhou University, China

Hailong Yang is an associate professor in School of Computer Science and Engineering, Beihang University, China. He received the PhD degree in the School of Computer Science and Engineering, Beihang University, China in 2014. His research interests include parallel and distributed computing, HPC, performance optimization and energy efficiency

Rujun Sun is a Ph.D student in the State Key Laboratory of Mathematical Engineering and Advanced Computing, China. Her research interests include high performance computing, deep learning and computation models

Jun Xu is a senior engineer in Beijing Simulation Center of the Second Institute of CASIC, China. She received the PhD degree of computer science and technology in Zhejiang University, China in 2011. Her research interest is modeling and simulation of weapon equipment system

Lin Gan is an assistant researcher in the Department of Computer Science and Technology at Tsinghua University, and the assistant director of the National Supercomputing Center in China. His research interests include high performance computing solutions based on hybrid platforms such as GPUs, FPGAs, and Sunway CPUs. Gan received a PhD in computer science from Tsinghua University, China. He is the recipient of the 2016 ACM Gordon Bell Prize, the 2017 ACM Gordon Bell Prize Finalist, the 2018 IEEE-CS TCHPC Early Career Researchers Award for Excellence in HPC, and the Most Significant Paper Award in 25 Years awarded by FPL 2015, etc. He is a member of IEEE

Guangwen Yang is a professor in the Department of Computer Science and Technology at Tsinghua University, and the director of the National Supercomputing Center in China. His research interests include parallel algorithms, cloud computing, and the earth system model. Yang received a PhD in computer science from Tsinghua University, China. He has received the ACM Gordon Bell Prize in the year of 2016 and 2017, and the Most Significant Paper Award in 25 Years awarded by FPL 2015, etc. He is a member of IEEE

Zhongzhi Luan received the PhD in the School of Computer Science of Xi’an Jiaotong University, China. He is an Associate Professor of Computer Science and Engineering, and Assistant Director of the Sino-German Joint Software Institute (JSI) Laboratory at Beihang University, China. Since 2003, His research interests including distributed computing, parallel computing, grid computing, HPC and the new generation of network technology

Depei Qian is a professor at the Department of Computer Science and Engineering, Beihang University, China. He received his master degree from University of North Texas, USA in 1984. He is currently serving as the chief scientist of China National High Technology Program (863 Program) on high productivity computer and service environment. He is also a fellow of China Computer Federation (CCF). His research interests include innovative technologies in distributed computing, high performance computing and computer architecture

References

[1]
Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel L D, Monfort M, Muller U, Zhang J K, Zhang X, Zhao J, Zieba K. End to end learning for self-driving cars. 2016, arXiv preprint arXiv: 1604.07316
[2]
Zhang K P, Zhang Z P, Li Z F, Qiao Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 2016, 23( 10): 1499–1503
[3]
Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014
[4]
Abadi M, Barham P, Chen J M, Chen Z F, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X Q. Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation. 2016, 265−283
[5]
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z M, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 8026−8037
[6]
Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M, J Xiao T J, Xu B, Zhang C Y, Zhang Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. 2015, arXiv preprint arXiv: 1512.01274
[7]
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature embedding. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, 675−678
[8]
Wang C, Gong L, Yu Q, Li X, Xie Y, Zhou X H. DLAU: A scalable deep learning accelerator unit on FPGA. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36( 3): 513–517
[9]
Jouppi N P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th ACM/IEEE Annual International Symposium on Computer Architecture. 2017, 1−12
[10]
Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: efficient primitives for deep learning. 2014, arXiv preprint arXiv: 1410.0759
[11]
Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J. Intel math kernel library. In: Wang E D, Zhang Q, Shen B, Zhang G Y, Lu X W, Wu Q, Wang Y J, eds. High-Performance Computing on the Intel® Xeon Phi™. Cham: Springer, 2014, 167−188
[12]
Rotem N, Fix J, Abdulrasool S, Catron G, Deng S, Dzhabarov R, Gibson N, Hegeman J, Lele M, Levenstein R, Montgomery J, Maher B, Nadathur S, Olesen J, Park J, Rakhov A, Smelyanskiy M, Wang M. Glow: graph lowering compiler techniques for neural networks. 2018, arXiv preprint arXiv: 1805.00907
[13]
Cyphers S, Bansal A, Bhiwandiwalla A, Bobba J, Brookhart M, Chakraborty A, Constable W, Convey C, Cook L, Kanawi O, Kimball O, Knight J, Korovaiko N, Kumar V, Lao Y X, Lishka C R, Menon J, Jennifer Myers, Narayana S A, Procter A, Webb T J. Intel nGraph: an intermediate representation, compiler, and executor for deep learning. 2018, arXiv preprint arXiv: 1801.08058
[14]
Vasilache N, Zinenko O, Theodoridis T, Goyal P, DeVito Z, Moses W S, Verdoolaege S, Adams A, Cohen A. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. 2018, arXiv preprint arXiv: 1802.04730
[15]
Chen T Q, Moreau T, Jiang Z H, Zheng L M, Yan E Q, Shen H C, Cowan M, Wang L Y, Hu Y W, Ceze L, Guestrin C, Krishnamurthy A. TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation. 2018, 578−594
[16]
Baghdadi R, Ray J, Ben Romdhane M, Del Sozzo E, Akkas A, Zhang Y M, Suriana P, Kamil S, Amarasinghe S. Tiramisu: a polyhedral compiler for expressing fast and portable code. In: Proceedings of 2019 IEEE/ACM International Symposium on Code Generation and Optimization. 2019, 193−205
[17]
Li M Z, Liu Y, Liu X Y, Sun Q X, You X, Yang H L, Luan Z Z, Gan L, Yang G W, Qian D P. The deep learning compiler: a comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2021, 32( 3): 708–727
[18]
Lin H, Tang X C, Yu B W, Zhuo Y W, Chen W G, Zhai J D, Yin W W, Zheng W M. Scalable graph traversal on sunway taihulight with ten million cores. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 635−645
[19]
Liu C X, Xie B W, Liu X, Xue W, Yang H L, Liu X. Towards efficient SpMV on sunway manycore architectures. In: Proceedings of 2018 International Conference on Supercomputing. 2018, 363−373
[20]
Li M Z, Liu Y, Yang H L, Luan Z Z, Qian D P. Multi-role SpTRSV on sunway many-core architecture. In: 2018 IEEE the 20th International Conference on High Performance Computing and Communications; IEEE the 16th International Conference on Smart City; IEEE the 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). 2018, 594−601
[21]
Wang X L, Liu W F, Xue W, Wu L. SwSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures. In: Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2018, 338−353
[22]
Xu Z G, Lin J, Matsuoka S. Benchmarking SW26010 many-core processor. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 2017, 743−752
[23]
Fang J R, Fu H H, Zhao W L, Chen B W, Zheng W J, Yang G W. swDNN: A library for accelerating deep learning applications on Sunway taihulight. In: Proceedings of 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2017, 615−624
[24]
Li L D, Fang J R, Fu H H, Jiang J L, Zhao W L, He C H, You X, Yang G W. swCaffe: a parallel framework for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of 2018 IEEE International Conference on Cluster Computing (CLUSTER). 2018, 413−422
[25]
Ma L X, Xie Z Q, Yang Z, Xue J L, Miao Y S, Cui W, Hu W X, Yang F, Zhang L T, Zhou L D. RAMMER: enabling holistic deep learning compiler optimizations with rTasks. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 50
[26]
Wang H J, Zhai J D, Gao M Y, Ma Z X, Tang S Z, Zheng L Y, Li Y Z, Rong K Y, Chen Y Y, Jia Z H. PET: optimizing tensor programs with partially equivalent transformations and automated corrections. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 37−54
[27]
Zheng Z, Yang X D, Zhao P Z, Long G P, Zhu K, Zhu F W, Zhao W Y, Liu X Y, Yang J, Zhai J D, Song S L, Lin W. AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2022, 359−373
[28]
Zheng L M, Jia C F, Sun M M, Wu Z, Yu C H, Haj-Ali A, Wang Y D, Yang J, Zhuo D Y, Sen K, Gonzalez J, Stoica I. Ansor: generating high-performance tensor programs for deep learning. In: Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 2020, 49
[29]
Lattner C, Amini M, Bondhugula U, Cohen A, Davis A, Pienaar J, Riddle R, Shpeisman T, Vasilache N, Zinenko O. MLIR: scaling compiler infrastructure for domain specific computation. In: Proceedings of 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 2021, 2−14
[30]
Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation. In: Proceedings of International Symposium on Code Generation and Optimization, 2004. CGO 2004. 2004, 75−86
[31]
Wei R, Schwartz L, Adve S V. DLVM: A modern compiler infrastructure for deep learning systems. In: Proceedings of the 6th International Conference on Learning Representations, 2018
[32]
Zhao J, Li B J, Nie W, Geng Z, Zhang R W, Gao X, Cheng B, Wu C, Cheng Y, Li Z, Di P, Zhang K, Jin X F. AKG: automatic kernel generation for neural processing units using polyhedral transformations. In: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. 2021, 1233−1248
[33]
Zhu K, Zhao W Y, Zheng Z, Guo T Y, Zhao P Z, Bai J J, Yang J, Liu X Y, Diao L S, Lin W. DISC: a dynamic shape compiler for machine learning workloads. In: Proceedings of the 1st Workshop on Machine Learning and Systems. 2021, 89−95
[34]
Jia Z H, Padon O, Thomas J, Warszawski T, Zaharia M, Aiken A. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. 2019, 47−62
[35]
Zhang J, Zhou C B, Wang Y G, Ju L L, Du Q, Chi X B, Xu D S, Chen D X, Liu Y, Liu Z. Extreme-scale phase field simulations of coarsening dynamics on the Sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 34–45
[36]
Fu H H, He C H, Chen B W, Yin Z K, Zhang Z G, Zhang W Q, Zhang T J, Xue W, Liu W G, Yin W W, Yang G W, Chen X F. 18.9-Pflops nonlinear earthquake simulation on Sunway taihulight: Enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 1–12
[37]
Yang C, Xue W, Fu H H, You H T, Wang X L, Ao Y L, Liu F F, Gan L, Xu P, Wang L N, Yang G W, Zheng W M. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 6
[38]
Li M Z, Liu Y, Yang H L, Luan Z Z, Gan L, Yang G W, Qian D P. Accelerating sparse cholesky factorization on Sunway manycore architecture. IEEE Transactions on Parallel and Distributed Systems, 2020, 31( 7): 1636–1650

Acknowledgements

This work was supported by the National Key Research and Development Program of China (No. 2020YFB1506703), the National Natural Science Foundation of China (Grant Nos. 62072018 and 61732002), the State Key Laboratory of Software Development Environment (No. SKLSDE-2021ZX-06), and the Fundamental Research Funds for the Central Universities.

RIGHTS & PERMISSIONS

2024 Higher Education Press
AI Summary AI Mindmap
PDF(8439 KB)

Accesses

Citations

Detail

Sections
Recommended

/