AcOrch: accelerating sampling-based GNN training under CPU-NPU heterogeneous environments

Kefu CHEN , Xin AI , Qiange WANG , Yanfeng ZHANG , Ge YU

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (5) : 2105103

PDF (7969KB)
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (5) :2105103 DOI: 10.1007/s11704-025-50893-0
Architecture
RESEARCH ARTICLE
AcOrch: accelerating sampling-based GNN training under CPU-NPU heterogeneous environments
Author information +
History +
PDF (7969KB)

Abstract

Graph Neural Networks (GNNs) have achieved remarkable success in various applications. Sampling-based GNN training, which conducts mini-batch training on sampled subgraphs, has become a promising solution for large-scale graphs. Given the resource-intensive nature of sampling-based GNN training, Neural Processing Units (NPUs), such as the Ascend AI processor, offer a promising alternative due to their high throughput and energy efficiency, making them well-suited for GNN workloads. However, the multi-stage nature of sampling-based training, which involves subgraph sampling, feature gathering, and model training, with different resource requirements and computation volume. This requires careful coordination to fully utilize the heterogeneous computation resources of CPUs and NPUs. In this work, we present AcOrch, a sampling-based GNN training system optimized for CPU-NPU heterogeneous platforms. AcOrch offers fine-grained task orchestration and adopts a two-level pipelined execution model to overlap sampling, gathering, and training. It analyzes the heterogeneous compute features of NPUs and maps tasks to AI Cube (AIC) units, AI Vector (AIV) units, and CPU cores accordingly. Moreover, the two-level pipeline enables overlapping execution not only between the CPU and NPU, but also among different types of compute units within the NPU (e.g., AIC and AIV units), thereby maximizing the utilization of available resources. Experiments on an Ascend 910B AI processor show that AcOrch achieves an average speedup of 2.31× over the state-of-the-art NPU-native graph learning system, MindSporeGL.

Graphical abstract

Keywords

GNN training / Ascend AI processor / task orchestrating

Cite this article

Download citation ▾
Kefu CHEN, Xin AI, Qiange WANG, Yanfeng ZHANG, Ge YU. AcOrch: accelerating sampling-based GNN training under CPU-NPU heterogeneous environments. Front. Comput. Sci., 2027, 21(5): 2105103 DOI:10.1007/s11704-025-50893-0

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Ai X, Wang Q, Cao C, Zhang Y, Chen C, Yuan H, Gu Y, Yu G . NeutronOrch: rethinking sample-based GNN training under CPU-GPU heterogeneous environments. Proceedings of the VLDB Endowment, 2024, 17( 8): 1995–2008

[2]

Mu N, Zha D, He Y, Tang Z. Graph attention networks for neural social recommendation. In: Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence. 2019, 1320−1327

[3]

Wang Q, Zhang Y, Wang H, Chen C, Zhang X, Yu G. NeutronStar: distributed GNN training with hybrid dependency management. In: Proceedings of 2022 International Conference on Management of Data. 2022, 1301−1315

[4]

Wang M, Yu L, Zheng D, Gan Q, Gai Y, Ye Z, Li M, Zhou J, Huang Q, Ma C, Huang Z, Guo Q, Zhang H, Lin H, Zhao J, Li J, Smola A J, Zhang Z. Deep graph library: towards efficient and scalable deep learning on graphs. In: Proceedings of Workshop on Representation Learning on Graphs and Manifolds. 2019

[5]

Song X, Huang H, Lian J, Jin H . XGCN: a library for large-scale graph neural network recommendations. Frontiers of Computer Science, 2024, 18( 3): 183343

[6]

Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017

[7]

Wu W, Shi X, He L, Jin H . TurboGNN: improving the end-to-end performance for sampling-based GNN training on GPUs. IEEE Transactions on Computers, 2023, 72( 9): 2571–2584

[8]

Liu X, Yan M, Deng L, Li G, Ye X, Fan D . Sampling methods for efficient training of graph convolutional networks: a survey. IEEE/CAA Journal of Automatica Sinica, 2022, 9( 2): 205–234

[9]

Wu Z, Pan S, Chen F, Long G, Zhang C, Yu P S . A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32( 1): 4–24

[10]

Liu J, Chen S, Shen L . A comprehensive survey on graph neural network accelerators. Frontiers of Computer Science, 2025, 19( 2): 192104

[11]

Zheng Y, Yi L, Wei Z . A survey of dynamic graph neural networks. Frontiers of Computer Science, 2025, 19( 6): 196323

[12]

Peng J, Chen Z, Shao Y, Shen Y, Chen L, Cao J . SANCUS: staleness-aware communication-avoiding full-graph decentralized training in large-scale graph neural networks. Proceedings of the VLDB Endowment, 2022, 15( 9): 1937–1950

[13]

Tian Y, Liu G, Wang J, Zhou M . ASA-GNN: adaptive sampling and aggregation-based graph neural network for transaction fraud detection. IEEE Transactions on Computational Social Systems, 2024, 11( 3): 3536–3549

[14]

Mishra S, Singh G, Bhattacharya M . Tissue specific tumor-gene link prediction through sampling based GNN using a heterogeneous network. Medical & Biological Engineering & Computing, 2024, 62( 8): 2499–2510

[15]

Wei X, Tang W, Qi H, Yue H. PGSampler: accelerating GPU-based graph sampling in GNN systems via workload fusion. In: Proceedings of 2024 IEEE International Conference on Cluster Computing. 2024, 51−61

[16]

Serafini M, Guan H . Scalable graph neural network training: the case for sampling. ACM SIGOPS Operating Systems Review, 2021, 55( 1): 68–76

[17]

Bai Y, Li C, Lin Z, Wu Y, Miao Y, Liu Y, Xu Y . Efficient data loader for fast sampling-based GNN training on large graphs. IEEE Transactions on Parallel and Distributed Systems, 2021, 32( 10): 2541–2556

[18]

Zhu Z, Wang P, Hu Q, Li G, Liang X, Cheng J. FastGL: a GPU-efficient framework for accelerating sampling-based GNN training at large scale. In: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 2024, 94−110

[19]

Alimohammadi Y, Ruiz L, Saberi A. A local graph limits perspective on sampling-based GNNs. In: Proceedings of the IEEE International Symposium on Information Theory. 2025, 1−6

[20]

Yang J, Tang D, Song X, Wang L, Yin Q, Chen R, Yu W, Zhou J. GNNLab: a factored system for sample-based GNN training over GPUs. In: Proceedings of the 17th European Conference on Computer Systems. 2022, 417−434

[21]

Min S W, Wu K, Hidayetoglu M, Xiong J, Song X, Hwu W M. Graph neural network training and data tiering. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2022, 3555−3565

[22]

Zhang X, Shen Y, Shao Y, Chen L . DUCATI: a dual-cache training system for graph neural networks on giant graphs with the GPU. Proceedings of the ACM on Management of Data, 2023, 1( 2): 166

[23]

Min S W, Wu K, Huang S, Hidayetoglu M, Xiong J, Ebrahimi E, Chen D, Hwu W M . Large graph convolutional network training with GPU-oriented data communication architecture. Proceedings of the VLDB Endowment, 2021, 14( 11): 2087–2100

[24]

Chen J, Ma T, Xiao C. FastGCN: fast learning with graph convolutional networks via importance sampling. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[25]

Zhang B, Kuppannagari S R, Kannan R, Prasanna V. Efficient neighbor-sampling-based GNN training on CPU-FPGA heterogeneous platform. In: Proceedings of 2021 IEEE High Performance Extreme Computing Conference. 2021, 1−7

[26]

Huang K, Jiang H, Wang M, Xiao G, Wipf D, Song X, Gan Q, Huang Z, Zhai J, Zhang Z . FreshGNN: reducing memory access via stable historical embeddings for graph neural network training. Proceedings of the VLDB Endowment, 2024, 17( 6): 1473–1486

[27]

Fey M, Lenssen J E, Weichert F, Leskovec J. GNNAutoScale: scalable and expressive graph neural networks via historical embeddings. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 3294−3304

[28]

Mohoney J, Waleffe R, Xu H, Rekatsinas T, Venkataraman S. Marius: learning massive graph embeddings on a single machine. In: Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation. 2021, 533−549

[29]

Yang D, Liu J, Qi J, Lai J. WholeGraph: a fast graph neural network training framework with multi-GPU distributed shared memory architecture. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2022, 54

[30]

Wang H, Zhang S, Feng K, Wang M, Yang Z. SaGNN: a sample-based GNN training and inference hardware accelerator. In: Proceedings of 2023 IEEE International Symposium on Circuits and Systems. 2023, 1−5

[31]

Song Y, Wang Y, Xiong C, Wang T, Tang P. An efficient sampling-based SpMM kernel for balancing accuracy and speed in GNN inference. In: Proceedings of 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. 2024, 468−475

[32]

He Y, Lai Z, Ran Z, Zhang L, Li D. SCGraph: accelerating sample-based GNN training by staged caching of features on GPUs. In: Proceedings of 2022 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking. 2022, 106−113

[33]

Liao H, Tu J, Xia J, Liu H, Zhou X, Yuan H, Hu Y. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: industry track paper. In: Proceedings of 2021 IEEE International Symposium on High-Performance Computer Architecture. 2021, 789−801

[34]

Zhu Z, Wang B, Yang C, Zhu R, Zhou M, Zheng N. Performance evaluation of MindSpore and PyTorch based on ascend NPU. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1826−1832

[35]

Jouppi N P, Young C, Patil N, Patterson D A, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, Boyle R, Cantin P, Chao C, Clark C, Coriell J, Daley M, Dau M, Dean J, Gelb B, Ghaemmaghami T V, Gottipati R, Gulland W, Hagmann R, Ho C R, Hogberg D, Hu J, Hundt R, Hurt D, Ibarz J, Jaffey A, Jaworski A, Kaplan A, Khaitan H, Killebrew D, Koch A, Kumar N, Lacy S, Laudon J, Law J, Le D, Leary C, Liu Z, Lucke K, Lundin A, MacKean G, Maggiore A, Mahony M, Miller K, Nagarajan R, Narayanaswami R, Ni R, Nix K, Norrie T, Omernick M, Penukonda N, Phelps A, Ross J, Ross M, Salek A, Samadiani E, Severn C, Sizikov G, Snelham M, Souter J, Steinberg D, Swing A, Tan M, Thorson G, Tian B, Toma H, Tuttle E, Vasudevan V, Walter R, Wang W, Wilcox E, Yoon D H. In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. 2017, 1−12

[36]

Medina E, Dagan E . Habana labs purpose-built AI inference and training processor architectures: Scaling AI training systems using standard Ethernet with Gaudi processor. IEEE Micro, 2020, 40( 2): 17–24

[37]

Xiao Y, Wang Z . AIbench: a tool for benchmarking Huawei Ascend AI processors. CCF Transactions on High Performance Computing, 2024, 6( 2): 115–129

[38]

Dhar A, Thorens C, Lazier L M, Cavigelli L. Ascend-CC: confidential computing on heterogeneous NPU for emerging generative AI workloads. 2024, arXiv preprint arXiv: 2407.11888

[39]

Wongpanich A, Oguntebi T, Paredes J B, Wang Y E, Phothilimthana P M, Mitra R, Zhou Z, Kumar N, Reddi V J. Machine learning fleet efficiency: analyzing and optimizing large-scale Google TPU systems with ML productivity goodput. 2025, arXiv preprint arXiv: 2502.06982

[40]

Song M, Tang X, Hou F, Li J, Wei W, Ma Y, Xiao R, Si H, Jiang D, Yin S, Hu Y, Long G. Tackling the dynamicity in a production LLM serving system with SOTA optimizations via hybrid prefill/decode/verify scheduling on efficient Meta-kernels. 2024, arXiv preprint arXiv: 2412.18106

[41]

Wang B, Yang C, Zhu R, Liu X, Zhou M, Zheng N. Analysis of performance and optimization in MindSpore on Ascend NPUs. In: Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems. 2023, 1701−1708

[42]

Wu R, Li M, Li H, Chen T, Tian X, Xu X, Zhou B, Chen J, An H. Machine learning-enabled performance model for DNN applications and AI accelerator. In: Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application. 2022, 25−34

[43]

Xue W, Yang K, Liu Y, Fan D, Xu P, Tian Y. Unlocking high performance with low-bit NPUs and CPUs for highly optimized HPL-MxP on Cloud Brain II. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2024, 1−16

[44]

Hamilton W L, Ying Z, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1025−1035

[45]

Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth. In: Proceedings of the 12th IEEE International Conference on Data Mining. 2012, 745−754

[46]

Leskovec J, Huttenlocher D, Kleinberg J. Predicting positive and negative links in online social networks. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 641−650

[47]

Ramezani M, Cong W, Mahdavi M, Sivasubramaniam A, Kandemir M T. GCN meets GPU: decoupling “when to sample” from “how to sample”. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1552

[48]

Leskovec J, Lang K J, Dasgupta A, Mahoney M W . Community structure in large networks: natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 2009, 6( 1): 29–123

[49]

Tang Y, Zhou H, Ji Z, Wang C . Cube-fx: mapping Taylor expansion onto matrix multiplier-accumulators of Huawei Ascend AI processors. IEEE Transactions on Parallel and Distributed Systems, 2025, 36( 6): 1115–1129

[50]

Wang C, Song P, Zhao H, Zhang F, Wang J, Zhang L. High-utilization GPGPU design for accelerating GEMM workloads: an incremental approach. In: Proceedings of 2024 IEEE International Symposium on Circuits and Systems. 2024, 1−5

[51]

Gui Y, Wu Q, Yuan W, Liang H, Wang X, Jin X. HBM-based hardware accelerator for GNN sampling and aggregation. In: Proceedings of 2024 IEEE High Performance Extreme Computing Conference. 2024, 1−7

[52]

Wang Q, Chen Y, Wong W F, He B . HongTu: scalable full-graph GNN training on multiple GPUs. Proceedings of the ACM on Management of Data, 2023, 1( 4): 246

[53]

Tan Z, Yuan X, He C, Sit M, Li G, Liu X, Ai B, Zeng K, Pietzuch P R, Mai L. Quiver: supporting GPUs for low-latency, high-throughput GNN serving with workload awareness. 2023, arXiv preprint arXiv: 2305.10863

[54]

Wang D, Xu J. Principal component analysis in the local differential privacy model. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 4795−4801

[55]

Jangda A, Polisetty S, Guha A, Serafini M. Accelerating graph sampling for graph machine learning using GPUs. In: Proceedings of the 16th European Conference on Computer Systems. 2021, 311−326

[56]

Lin Z, Li C, Miao Y, Liu Y, Xu Y. PaGraph: scaling GNN training on large graphs via computation-aware caching. In: Proceedings of the 11th ACM Symposium on Cloud Computing. 2020, 401−415

[57]

Ai X, Zhang B, Wang Q, Zhang Y, Yuan H, Gong S, Yu G. NeutronAscend: optimizing GNN training with ascend AI processors. ACM Transactions on Architecture and Code Optimization, 2025

[58]

Tang Y, Wang C . Performance modeling on DaVinci AI core. Journal of Parallel and Distributed Computing, 2023, 175: 134–149

RIGHTS & PERMISSIONS

Higher Education Press

PDF (7969KB)

Supplementary files

Highlights

637

Accesses

0

Citation

Detail

Sections
Recommended

/