PDF
Abstract
Developing parallel applications on heterogeneous processors is facing the challenges of ‘memory wall’, due to limited capacity of local storage, limited bandwidth and long latency for memory access. Aiming at this problem, a parallelization approach was proposed with six memory optimization schemes for CG, four schemes of them aiming at all kinds of sparse matrix-vector multiplication (SPMV) operation. Conducted on IBM QS20, the parallelization approach can reach up to 21 and 133 times speedups with size A and B, respectively, compared with single power processor element. Finally, the conclusion is drawn that the peak bandwidth of memory access on Cell BE can be obtained in SPMV, simple computation is more efficient on heterogeneous processors and loop-unrolling can hide local storage access latency while executing scalar operation on SIMD cores.
Keywords
multi-core processor
/
NAS parallelization
/
CG
/
memory optimization
Cite this article
Download citation ▾
Lin Deng, Yong Dou.
Multi-core optimization for conjugate gradient benchmark on heterogeneous processors.
Journal of Central South University, 2011, 18(2): 490-498 DOI:10.1007/s11771-011-0722-6
| [1] |
PETRINI F, FOSSUM G, FERNANDEZ J, KISTLER M, PERRONE M. Multicore surprise: Lessons learned from optimizing sweep3D on the cell broadband engine [C]// IPDPS. California, 2007: 62.
|
| [2] |
BADER D, AGARWAL V, MADDURI K. On the design and analysis of irregular algorithms on the cell processor: A case study of list ranking [C]// IPDPS. California, 2007: 76.
|
| [3] |
SARANTA S, REVATHI S, CHITRA P. Scheduling independent tasks on heterogeneous distributed computing systems using multiobjective optimization approach on multicore processors [C]// ACT. 2009: 481–483.
|
| [4] |
CarterJ., HsiehW., SwansonM., ZhangL.-xin.. Memory system support for irregular applications [J]. LNCS, 1998, 1511: 17-26
|
| [5] |
CARTER J, HSIEH W, SWANSON M, ZHANG Li-xin. Impulse: Building a smarter memory controller [C]// HPCA. Orlando, 1999: 70–79.
|
| [6] |
KimD., ChaudhuriM.. Architectural support for uniprocessor and multiprocessor active memory systems [J]. IEEE Transactions on Computers, 2004, 53(3): 288-307
|
| [7] |
MorrisG., PrasannaV.. Sparse matrix computations on reconfigurable hardware [J]. IEEE Transactions on Computers, 2007, 40(3): 58-64
|
| [8] |
HEATH, PINAR A, MICHAEL T. Improving performance of sparse matrix-vector Multiplication [C]// ACM/IEEE SC Conference. Portland, 1999: 30.
|
| [9] |
STATHIS P, VASSILIADIS S, COTIFANA S. A hierarchical sparse matrix storage format for vector Processors [C]// IPDPS. Santa Fe, 2004: 61a.
|
| [10] |
AZEVEDO E, FAHEY M, MILLS R. Vectorized sparse matrix multiply for compressed row storage format [C]// ICCS. Atlanta, 2005: 99–106.
|
| [11] |
COTOFANA S, STATHIS P, VASSILIADIS S. Direct and transposed sparse matrix-vector multiplication [C]// MPCS. Los Alamitos, 2002: 1õ9.
|
| [12] |
HendricksonB., LelandR., PlimptonS.. An efficient parallel algorithm for matrix-vector multiplication [J]. High Speed Compute, 1995, 7(1): 73-88
|
| [13] |
Ziavras, JinD.-j., SotiriosG.. A super-programming technique for large sparse matrix multiplication on pc clusters [J]. IEICE, 2004, E87(D): 1774-1781
|
| [14] |
BisselingR. H., MesssenW.. Communication balancing in parallel sparse matrix-vector multiplication [J]. Electronic Transactions on Numerical Analysis, 2005, 21: 47-65
|
| [15] |
WILLIAMS, DEMMEL S, VUDUC, LEONID R Optimization of sparse matrix-vector multiplication on emerging Multicore Platforms [C]// SC. Nevadaz, 2007: 178–194.
|
| [16] |
CHRISTEN M, SCHENK O, BURKHART H. General-purpose sparse matrix building blocks general-purpose sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform [C]// GPGPU. Boston, 2007: 38–46.
|
| [17] |
Smit, WiggersW., BakkerV., KokkelerA.Implementing the conjugate gradient algorithm on multi-core systems [C], 2007, Tampere, SOC: 1-4
|
| [18] |
DOU Yong, DENG Lin. DMA performance analysis and multicore memory optimization for SWIM benchmark on the Cell Processor [C]// ISPA. Sydney, 2008: 170–179.
|