Exploiting synchronization-aware transformer for aligning large-scale MPI traces
Zhibo XUAN , Xin YOU , Hailong YANG , Haoran KONG , Jingqi CHEN , Tianyu FENG , Zhongzhi LUAN , Yi LIU , Depei QIAN
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (5) : 2105104
For large-scale parallel programs, intricate software behavior and complex hardware architecture lead to synchronized clocks across multiple nodes, resulting in misaligned traces for each node across the timeline, which is also known as time skew. This misalignment hampers various analyses along the temporal dimension, making it challenging to effectively optimize the performance of parallel programs. Furthermore, the time alignment across a massive amount of processes imposes a substantial computational burden, rendering existing solvers inadequate in massively parallel scenarios. In this paper, we propose TLBERT, a novel approach for timeline alignment that incorporates machine learning techniques with a well-designed training methodology. Based on TLBERT, we implement STAR, a Large-Scale Synced Trace Timeline Aligner tool to tackle the time skew problem for large-scale parallel programs. Experimental results demonstrate that STAR achieves timeline alignment for traces of large-scale programs with minimal overhead, effectively mitigating the time skew problem.
large-scale parallel programs / MPI traces / timeline alignment / time skew / transformer
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
|
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
Higher Education Press
/
| 〈 |
|
〉 |