Programming for scientific computing on peta-scale heterogeneous parallel systems

Can-qun Yang; Qiang Wu; Tao Tang; Feng Wang; Jing-ling Xue

doi:10.1007/s11771-013-1602-z

Journal of Central South University ›› 2013, Vol. 20 ›› Issue (5) :1189 -1203. DOI: 10.1007/s11771-013-1602-z

Article

Programming for scientific computing on peta-scale heterogeneous parallel systems

Can-qun Yang 11602
, Qiang Wu 11602
, Tao Tang 11602^,^c
, Feng Wang 11602
, Jing-ling Xue 21602

Author information +

History +

PDF

Abstract

Peta-scale high-performance computing systems are increasingly built with heterogeneous CPU and GPU nodes to achieve higher power efficiency and computation throughput. While providing unprecedented capabilities to conduct computational experiments of historic significance, these systems are presently difficult to program. The users, who are domain experts rather than computer experts, prefer to use programming models closer to their domains (e.g., physics and biology) rather than MPI and OpenMP. This has led the development of domain-specific programming that provides domain-specific programming interfaces but abstracts away some performance-critical architecture details. Based on experience in designing large-scale computing systems, a hybrid programming framework for scientific computing on heterogeneous architectures is proposed in this work. Its design philosophy is to provide a collaborative mechanism for domain experts and computer experts so that both domain-specific knowledge and performance-critical architecture details can be adequately exploited. Two real-world scientific applications have been evaluated on TH-1A, a peta-scale CPU-GPU heterogeneous system that is currently the 5th fastest supercomputer in the world. The experimental results show that the proposed framework is well suited for developing large-scale scientific computing applications on peta-scale heterogeneous CPU/GPU systems.

Keywords

heterogeneous parallel system / programming framework / scientific computing / GPU computing / molecular dynamic

Cite this article

Download citation ▾

Can-qun Yang, Qiang Wu, Tao Tang, Feng Wang, Jing-ling Xue. Programming for scientific computing on peta-scale heterogeneous parallel systems. Journal of Central South University, 2013, 20(5): 1189-1203 DOI:10.1007/s11771-013-1602-z

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	OwensJ, HoustonM, LuebkeD, GreenS, StoneJ, PhillipsJ. GPU computing [C]. Proceedings of the IEEE, 2008CaliforniaIEEE Press879-899

[2]	RyooS, RodriguesC, BaghsorkhiS, StoneS, KirkD, HwuW. Optimization principles and application performance evaluation of a multi-threaded GPU using CUDA [C]. Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008New YorkACM73-82

[3]	KindratenkoV, EnosJ, ShiG, ShowermanM, ArnoldG, StoneJ, PhillipsJ, HwuW. GPU clusters for high-performance computing [C]. IEEE International Conference on Cluster Computing and Workshops 2009, 2009LouisianaIEEE Press1-8

[4]	YangX-j, LiaoX-k, LuK, HuQ-f, SongJ-q, SuJ-shu. The tianhe-1A supercomputer: Its hardware and software [J]. Journal of Computer Science and Technology, 2011, 26(3): 344-351

[5]	Top500 list [EB/OL]. [2011-05]. http://www.top500.org.

[6]	van DeursenA, KlintP, VisserJ. Domain-specific language: An annotated bibliography [J]. ACM Sigplan Notice, 2000, 35(6): 26-36

[7]	MoZ-y, ZhangA-q, CaoX-l, LiuQ-k, XuX-w, AnH-b, PeiW-b, ZhuS-ping. Jasmin: A parallel software infrastructure for scientific computing [J]. Frontiers of Computer Science in China, 2010, 4(4): 480-488

[8]	NvidiaCCompute unified device architecture programming guide [M], 2010Santa Clara, CANVIDIA4-12

[9]	MunshiA. The opencl specification [M]. Khronos OpenCL Working Group, 20093-7

[10]	ElnozahyE, PlankJ. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery [J]. IEEE Transactions on Dependable and Secure Computing, 2004, 1(2): 97-108

[11]	ZhengG, ShiL, KaleL. Ftc-charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and mpi [C]. Proceedings of the 2004 IEEE International Conference on Cluster Computing, 2004Washington, DC, USAIEEE Computer Society93-103

[12]	ZhengG, NiX, KaleLA scalable double in-memory checkpoint and restart scheme towards exascale [R], 2012Urbana, ILParallel Programming Laboratory

[13]	MoodyA. The scalable checkpoint/restart (scr) library: Approaching file bandwidth of 1 tb/s [C]. Talk given at the 2009 Fault Tolerance for Extreme-Scale Computing Workshop, 2009Albuquerque, USAIEEE Press1-11

[14]	PlimptonS, HendricksonB. Parallel molecular dynamics with the embedded atom method [C]. Materials Research Society Symposium Proceedings, 1993CambridgeCambridge University Press37-37

[15]	VerletL. Computer experiments on classical fluids: I. Thermodynamical properties of Leannard-Jones molecules [J]. Phyics Review, 1967, 159(1): 98

[16]	RodriguesC, HardyD, StoneJ, SchultenK, HwuW. GPU acceleration of cutoff pair potentials for molecular modeling applications [C]. Proceedings of the 5th Conference on Computing Frontiers, 2008Ischia, ItalyACM273-282

[17]	StoneJ, PhillipsJ, FreddolinoP, HardyD, TrabucoL, SchultenK. Accelerating molecular modeling applications with graphics processors [J]. Journal of Computational Chemistry, 2007, 28(16): 2618-2640

[18]	AndersonJ, KeysA, PhillipsC, NguyenD, GlotzerS. Hoomd-blue, general-purpose many-body dynamics on the GPU [J]. Bulletin of the American Physical Society, APS, 2010, 55: 18-28

[19]	HardyD, StoneJ, VandivortK, GoharaD, RodriguesC, SchultenK. Fast molecular electrostatics algorithms on GPUs [J]. GPU Computing Gems Emerald Edition, Morgan Kaufmann, 2010, 51: 61-83

[20]	JiangW, HardyD, PhillipsJ, MackerellJ, SchultenK, RouxB. High-performance scalable molecular dynamics simulations of a polarizable force field based on classical Drude oscillators in NAMD [J]. The Journal of Physical Chemistry Letters, 2011, 2: 87-92

[21]	LarssonP, HessB, LindahlE. Algorithm improvements for molecular dynamics simulations [J]. Wiley Interdisciplinary Reviews: Computational Molecular Science, 2011, 1(1): 93-108

[22]	AndersonJ, LorenzC, TravessetA. General purpose molecular dynamics simulations fully implemented on graphics processing units [J]. Journal of Computational Physics, 2008, 227(10): 5342-5359

[23]	Try to implement a atomicadddouble function [EB/OL]. [2011-01-05]. http://forums.nvidia.com/index.php?showtopic=191872.

[24]	HayesD, CoxW, GroveM. Micro-jet printing of polymers and solder for electronics manufacturing [J]. Journal of Electronics Manufacturing, 1998, 8: 209-216

[25]	GoddekeD, StrzodkaR, Mohd-yusofJ, MccormickP, BuijssenS, GrajewskiM, TurekS. Exploring weak scalability for FEM calculations on a GPU-enhanced cluster [J]. Parallel Computing, 2007, 33(10/11): 685-699

[26]	PhillipsJ, StoneJ, SchultenK. Adapting a message-driven parallel application to GPU-accelerated clusters [C]. Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008Austin, USAIEEE Press1-9

[27]	YangC-q, WangF, DuY-f, ChenJ, LiuJ, YiH-z, LuKai. Adaptive optimization for petascale heterogeneous CPU/GPU computing [C]. 2010 IEEE International Conference on Cluster Computing, 2010GreeceIEEE Press19-28

[28]	WangF, YangC-q, DuY-f, ChenJ, YiH-z, XuW-xia. Optimizing linpack benchmark on gpu-accelerated petascale supercomputer [J]. Journal of Computer Science and Technology, 2010, 26(5): 854-865

[29]	Samrai [EB/OL]. [2012-05]. http://computation.llnl.gov/casc/SAMRAI.

[30]	CaoX-l, MoZ-y, LiuQ-k, XuX-w, ZhangA-qing. Parallel implementation of fast multipole method based on jasmine [J]. Science China Information Sciences, 2011, 54(4): 757-766

[31]	PlimptonS. Fast parallel algorithms for short-range molecular dynamics [J]. Journal of Computational Physics, 1995, 117(1): 1-19

[32]	ShiG, KindratenkoV. Implementation of NAMD molecular dynamics non-bonded force-field on the cell broadband engine processo [C]. IEEE International Symposium on Parallel and Distributed Processing 2008, 2008MiamiIEEE Press1-8

[33]	ShawD, DeneroffM, DrorR, KuskinJ, LarsonR, SalmonJ, YoungC, BatsonB, BowersK, ChaoJ. Atnton, a special-purpose machine for molecular dynamics simulation [C]. ACM SIGARCH Computer Architecture News, 2007New YorkACM1-12

[34]	LindahlE, HessB, van der SpoelD. GROMACS 3.0: A package for molecular simulation and trajectory analysis [J]. Journal of Molecular Modeling, 2001, 7(8): 306-317

[35]	PhillipsJ, BraunR, WangW, GumbartJ, TajkhorshidE, VillaE, ChipotC, SkeelR, KaleL, SchultenK. Scalable molecular dynamics with NAMD [J]. Journal of Computational Chemistry, 2005, 26(16): 1781

[36]	SchiveH, ChienC, WongS, TsaiS, ChiuehT. Graphic-card cluster for astrophysics (GraCCA)-performance tests [J]. New Astronomy, 2008, 13(6): 418-435

[37]	GarlandM, Le GrandS, NickollsJ, AndersonJ, HardwickJ, MortonS, PhillipsE, ZhangY, VolkovV. Parallel computing experiences with CUDA [J]. Micro, 2008, 28(4): 13-27