Exploiting synchronization-aware transformer for aligning large-scale MPI traces

Zhibo XUAN , Xin YOU , Hailong YANG , Haoran KONG , Jingqi CHEN , Tianyu FENG , Zhongzhi LUAN , Yi LIU , Depei QIAN

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (5) : 2105104

PDF (5953KB)
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (5) :2105104 DOI: 10.1007/s11704-026-51562-6
Architecture
RESEARCH ARTICLE
Exploiting synchronization-aware transformer for aligning large-scale MPI traces
Author information +
History +
PDF (5953KB)

Abstract

For large-scale parallel programs, intricate software behavior and complex hardware architecture lead to synchronized clocks across multiple nodes, resulting in misaligned traces for each node across the timeline, which is also known as time skew. This misalignment hampers various analyses along the temporal dimension, making it challenging to effectively optimize the performance of parallel programs. Furthermore, the time alignment across a massive amount of processes imposes a substantial computational burden, rendering existing solvers inadequate in massively parallel scenarios. In this paper, we propose TLBERT, a novel approach for timeline alignment that incorporates machine learning techniques with a well-designed training methodology. Based on TLBERT, we implement STAR, a Large-Scale Synced Trace Timeline Aligner tool to tackle the time skew problem for large-scale parallel programs. Experimental results demonstrate that STAR achieves timeline alignment for traces of large-scale programs with minimal overhead, effectively mitigating the time skew problem.

Graphical abstract

Keywords

large-scale parallel programs / MPI traces / timeline alignment / time skew / transformer

Cite this article

Download citation ▾
Zhibo XUAN, Xin YOU, Hailong YANG, Haoran KONG, Jingqi CHEN, Tianyu FENG, Zhongzhi LUAN, Yi LIU, Depei QIAN. Exploiting synchronization-aware transformer for aligning large-scale MPI traces. Front. Comput. Sci., 2027, 21(5): 2105104 DOI:10.1007/s11704-026-51562-6

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Anderson J, Liu Y, Mellor-Crummey J. Preparing for performance analysis at exascale. In: Proceedings of the 36th ACM International Conference on Supercomputing. 2022, 34

[2]

Guo Y, Wang Y, Ma S . A perspective on digital signal processor based leadership performance accelerator for AI and HPC. Frontiers of Computer Science, 2025, 19( 7): 197109

[3]

Ziogas A N, Ben-Nun T, Fernández G I, Schneider T, Luisier M, Hoefler T. A data-centric approach to extreme-scale Ab initio dissipative quantum transport simulations. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 1−13

[4]

Thompson A P, Aktulga H M, Berger R, Bolintineanu D S, Brown W M, Crozier P S, in ’t Veld P J, Kohlmeyer A, Moore S G, Nguyen T D, Shan R, Stevens M J, Tranchida J, Trott C, Plimpton S J . LAMMPS - a flexible simulation tool for particle-based materials modeling at the atomic, meso, and continuum scales. Computer Physics Communications, 2022, 271: 108171

[5]

Li F, Liu X, Liu Y, Zhao P, Yang Y, Shang H, Sun W, Wang Z, Dong E, Chen D. SW_Qsim: a minimize-memory quantum simulator with high-performance on a new Sunway supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 1−13

[6]

Fu H, He C, Chen B, Yin Z, Zhang Z, Zhang W, Zhang T, Xue W, Liu W, Yin W, Yang G, Chen X. 18.9-pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2017, 1−12

[7]

Ilyushin Y, Afanaseva O. Multithreading analysis of seismic data on the hybrid supercomputer. In: Proceedings of the 19th International Multidisciplinary Scientific GeoConference. 2019, 973−980

[8]

Min M, Lan Y H, Fischer P, Merzari E, Kerkemeier S, Phillips M, Rathnayake T, Novak A, Gaston D, Chalmers N, Warburton T. Optimization of full-core reactor simulations on summit. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2022, 1−11

[9]

Joubert W, Weighill D, Kainer D, Climer S, Justice A, Fagnan K, Jacobson D. Attacking the opioid epidemic: determining the epistatic and pleiotropic genetic architectures for chronic pain and opioid addiction. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2018, 717−730

[10]

Jia W, Wang H, Chen M, Lu D, Lin L, Car R, Weinan E, Zhang L. Pushing the limit of molecular dynamics with Ab initio accuracy to 100 million atoms with machine learning. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1−14

[11]

Sui J, Ding S, Huang X, Yu Y, Liu R, Xia B, Ding Z, Xu L, Zhang H, Yu C, Bu D . A survey on deep learning-based algorithms for the traveling salesman problem. Frontiers of Computer Science, 2025, 19( 6): 196322

[12]

Zheng Y, Yi L, Wei Z . A survey of dynamic graph neural networks. Frontiers of Computer Science, 2025, 19( 6): 196323

[13]

Qu C, Dai S, Wei X, Cai H, Wang S, Yin D, Xu J, Wen J R . Tool learning with large language models: a survey. Frontiers of Computer Science, 2025, 19( 8): 198343

[14]

Liu F, Yang N, Li H, Wang Z, Song Z, Pei S, Jiang L. SPARK: scalable and precision-aware acceleration of neural networks via efficient encoding. In: Proceedings of 2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 2024, 1029−1042

[15]

Hao X, Fang T, Chen J, Gu J, Feng J, An H, Zhao C . swMPAS-A: scaling MPAS-A to 39 million heterogeneous cores on the new generation Sunway supercomputer. IEEE Transactions on Parallel and Distributed Systems, 2023, 34( 1): 141–153

[16]

Yang C, Xue W, Fu H, You H, Wang X, Ao Y, Liu F, Gan L, Xu P, Wang L, Yang G, Zheng W. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 57−68

[17]

Skamarock W C, Klemp J B, Dudhia J, Gill D O, Barker D M, Wang W, Powers J G . A description of the advanced research WRF version 2. Ncar Technical, 2005, 113: 7–25

[18]

Liu Y, Liu X, Li F, Fu H, Yang Y, Song J, Zhao P, Wang Z, Peng D, Chen H, Guo C, Huang H, Wu W, Chen D. Closing the ”quantum supremacy” gap: achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021, 1−12

[19]

Das S, Kanungo B, Subramanian V, Panigrahi G, Motamarri P, Rogers D, Zimmerman P M, Gavini V. Large-scale materials modeling at quantum accuracy: Ab initio simulations of quasicrystals and interacting extended defects in metallic alloys. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2023, 1−12

[20]

Fedeli L, Huebl A, Boillod-Cerneux F, Clark T, Gott K, Hillairet C, Jaure S, Leblanc A, Lehe R, Myers A, Piechurski C, Sato M, Zaim N, Zhang W, Vay J L, Vincenti H. Pushing the frontier in the design of laser-based electron accelerators with groundbreaking mesh-refined particle-in-cell simulations on exascale-class supercomputers. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2022, 1−12

[21]

Mao G, Böhme D, Hermanns M A, Geimer M, Lorenz D, Wolf F. Catching idlers with ease: a lightweight wait-state profiler for MPI programs. In: Proceedings of the 21st European MPI Users’ Group Meeting. 2014, 103−108

[22]

Hidayetoğlu M, Biçer T, De Gonzalo S G, Ren B, Gürsoy D, Kettimuthu R, Foster I T, Hwu W M W. MemXCT: memory-centric X-ray CT reconstruction with massive parallelization. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019, 1−56

[23]

Geimer M, Wolf F, Wylie B J N, Ábrahám E, Becker D, Mohr B . The Scalasca performance toolset architecture. Concurrency and Computation: Practice and Experience, 2010, 22( 6): 702–719

[24]

Shende S, Malony A D, Spear W, Schuchardt K. Characterizing I/O performance using the TAU performance system. In: Proceedings of Applications, Tools and Techniques on the Road to Exascale Computing. 2012, 647−655

[25]

Adhianto L, Banerjee S, Fagan M, Krentel M, Marin G, Mellor-Crummey J, Tallent N R . HPCTOOLKIT: tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 2010, 22( 6): 685–701

[26]

Knüpfer A, Brunst H, Doleschal J, Jurenz M, Lieber M, Mickler H, Müller M S, Nagel W E. The Vampir performance analysis tool-set. In: Proceedings of the 2nd International Workshop on Parallel Tools for High Performance Computing. 2008, 139−155

[27]

Knüpfer A, Rössel C, Mey D A, Biersdorff S, Diethelm K, Eschweiler D, Geimer M, Gerndt M, Lorenz D, Malony A, Nagel W E, Oleynik Y, Philippen P, Saviankou P, Schmidl D, Shende S, Tschüter R, Wagner M, Wesarg B, Wolf F. Score-P: a joint performance measurement run-time infrastructure for periscope, Scalasca, TAU, and Vampir. In: Proceedings of the 5th International Workshop on Parallel Tools for High Performance Computing. 2011, 79−91

[28]

Brightwell R, Underwood K D. An analysis of NIC resource usage for offloading MPI. In: Proceedings of 18th International Parallel and Distributed Processing Symposium. 2004, 183

[29]

Vetter J S, McCracken M O. Statistical scalability analysis of communication operations in distributed applications. In: Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming. 2001, 123−132

[30]

Wu X, Mueller F . ScalaExtrap: trace-based communication extrapolation for SPMD programs. ACM SIGPLAN Notices, 2011, 46( 8): 113–122

[31]

Zheng L, Zhai J, Tang X, Wang H, Yu T, Jin Y, Song S L, Chen W. Vapro: performance variance detection and diagnosis for production-run parallel applications. In: Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2022, 150−162

[32]

Jin Y, Wang H, Yu T, Tang X, Hoefler T, Liu X, Zhai J. SCALANA: automating scaling loss detection with graph analysis. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. 2020, 1−14

[33]

Petrini F, Kerbyson D J, Pakin S. The case of the missing supercomputer performance: achieving optimal performance on the 8, 192 processors of ASCI Q. In: Proceedings of the 2003 ACM/IEEE Conference on Supercomputing. 2003, 55

[34]

Böhme D, Geimer M, Arnold L, Voigtlaender F, Wolf F . Identifying the root causes of wait states in large-scale parallel applications. ACM Transactions on Parallel Computing (TOPC), 2016, 3( 2): 11

[35]

Underwood K D, Brightwell R. The impact of MPI queue usage on message latency. In: Proceedings of International Conference on Parallel Processing. 2004, 152−160

[36]

Riesen R, Brightwell R, Maccabe A B. Measuring MPI latency variance. In: Proceedings of 10th European on Recent Advances in Parallel Virtual Machine and Message Passing Interface. 2003, 112−116

[37]

Bayatpour M, Sarkauskas N, Subramoni H, Maqbool Hashmi J, Panda D K. BluesMPI: efficient MPI non-blocking alltoall offloading designs on modern BlueField smart NICs. In: Proceedings of 36th International Conference on High Performance Computing. 2021, 18−37

[38]

Mintchev S, Getov V. PMPI: High-level message passing in Fortran77 and C. In: Proceedings of International Conference and Exhibition on High-Performance Computing and Networking. 1997, 601−614

[39]

Weaver V M, Johnson M, Kasichayanula K, Ralph J, Luszczek P, Terpstra D, Moore S. Measuring energy and power with PAPI. In: Proceedings of the 2012 41st International Conference on Parallel Processing Workshops. 2012, 262−268

[40]

Wei L, Mellor-Crummey J. Using sample-based time series data for automated diagnosis of scalability losses in parallel programs. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2020, 144−159

[41]

Becker D, Rabenseifner R, Wolf F, Linford J C . Scalable timestamp synchronization for event traces of message-passing applications. Parallel Computing, 2009, 35( 12): 595–607

[42]

Hu H, Jiang C, Zhong Y, Peng Y, Wu C, Zhu Y, Lin H, Guo C. dPRO: a generic performance diagnosis and optimization toolkit for expediting distributed DNN training. In: Proceedings of the 5th Conference on Machine Learning and Systems. 2022, 623−637

[43]

Rabenseifner R. The controlled logical clock-a global time for trace based software monitoring of parallel applications in workstation clusters. In: Proceedings of the 5th Euromicro Workshop on Parallel and Distributed Processing. 1997, 477−484

[44]

Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto H P, , et al. Evaluating large language models trained on code. 2021, arXiv preprint arXiv: 2107.03374

[45]

Fu Y, Peng H, Khot T. How does GPT obtain its ability? Tracing emergent abilities of language models to their sources. See yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tracing-Emergent-Abilities-of-Language-Models-to-their-Sources-b9a57ac0fcf74f30a1ab9e3e36fa1dc1 website, 2022

[46]

Drori I, Tran S, Wang R, Cheng N, Liu K, Tang L, Ke E, Singh N, Patti T L, Lynch J, Shporer A, Verma N, Wu E, Strang G. A neural network solves and generates mathematics problems by program synthesis: calculus, differential equations, linear algebra, and more. Proceedings of the National Academy of Sciences of the United States of America, 2021

[47]

Xue R, Liu X, Wu M, Guo Z, Chen W, Zheng W, Zhang Z, Voelker G. MPIWiz: subgroup reproducible replay of MPI applications. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 2009, 251−260

[48]

Bouteiller A, Bosilca G, Dongarra J. Retrospect: deterministic replay of MPI applications for interactive distributed debugging. In: Proceedings of 14th European Parallel Virtual Machine and Message Passing Interface. 2007, 297−306

[49]

Desprez F, Markomanolis G S, Quinson M, Suter F. Assessing the performance of MPI applications through time-independent trace replay. In: Proceedings of 2011 40th International Conference on Parallel Processing Workshops. 2011, 467−476

[50]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171−4186

[51]

Bailey D H. NAS parallel benchmarks. In: Padua D, ed. Encyclopedia of Parallel Computing. New York: Springer, 2011, 1254−1259

[52]

Noeth M, Ratn P, Mueller F, Schulz M, de Supinski B R . ScalaTrace: scalable compression and replay of communication traces for high-performance computing. Journal of Parallel and Distributed Computing, 2009, 69( 8): 696–710

[53]

Boehme D, Gamblin T, Beckingsale D, Bremer P T, Gimenez A, LeGendre M, Pearce O, Schulz M. Caliper: performance introspection for HPC software stacks. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2016, 550−560

[54]

Zhou D, Gao Y, Li H, Liu X, Lin Q . Trajectory prediction based on grouped spatial-temporal encoder. Frontiers of Computer Science, 2025, 19( 11): 1911373

[55]

Lamport L. Time, clocks, and the ordering of events in a distributed system. In: Commun. ACM. 1978, 21(7): 558--565

RIGHTS & PERMISSIONS

Higher Education Press

PDF (5953KB)

Supplementary files

Highlights

210

Accesses

0

Citation

Detail

Sections
Recommended

/