Improving vertex-frontier based GPU breadth-first search

Bo Yang; Kai Lu; Ying-hui Gao; Kai Xu; Xiao-ping Wang; Zhi-quan Cheng

doi:10.1007/s11771-014-2368-7

Journal of Central South University ›› 2014, Vol. 21 ›› Issue (10) :3828 -3836. DOI: 10.1007/s11771-014-2368-7

Article

Improving vertex-frontier based GPU breadth-first search

Bo Yang ¹^,²
, Kai Lu ¹^,²^,^b
, Ying-hui Gao ³
, Kai Xu ¹^,²
, Xiao-ping Wang ¹^,²
, Zhi-quan Cheng ⁴

Author information +

History +

PDF

Abstract

Breadth-first search (BFS) is an important kernel for graph traversal and has been used by many graph processing applications. Extensive studies have been devoted in boosting the performance of BFS. As the most effective solution, GPU-acceleration achieves the state-of-the-art result of 3.3×10⁹ traversed edges per second on a NVIDIA Tesla C2050 GPU. A novel vertex frontier based GPU BFS algorithm is proposed, and its main features are three-fold. Firstly, to obtain a better workload balance for irregular graphs, a virtual-queue task decomposition and mapping strategy is introduced for vertex frontier expanding. Secondly, a global deduplicate detection scheme is proposed to remove reduplicative vertices from vertex frontier effectively. Finally, a GPU-based bottom-up BFS approach is employed to process large frontier. The experimental results demonstrate that the algorithm can achieve 10% improvement over the state-of-the-art method on diverse graphs. Especially, it exhibits 2–3 times speedup on low-diameter and scale-free graphs over the state-of-the-art on a NVIDIA Tesla K20c GPU, reaching a peak traversal rate of 11.2×10⁹ edges/s.

Keywords

breadth-first search / GPU / graph traversal / vertex frontier

Cite this article

Download citation ▾

Bo Yang, Kai Lu, Ying-hui Gao, Kai Xu, Xiao-ping Wang, Zhi-quan Cheng. Improving vertex-frontier based GPU breadth-first search. Journal of Central South University, 2014, 21 (10) : 3828-3836 DOI:10.1007/s11771-014-2368-7

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	ZerbinoD R, VelvetB E. Algorithms for de Novo short read assembly using de Bruijn graphs [J]. Genome Research, 2008, 18(5): 821-829

[2]	BakosJ D. High-performance heterogeneous computing with the convey HC-1 [J]. Computing in Science & Engineering, 2010, 12(6): 80-87

[3]	MalewiczG, AusternM H, BikA J C, DehnertJ C, HornI, LeiserN, PregelC G. A system for large-scale graph processing [C]. Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010, USA, ACM Press: 135-146

[4]	KwakH, LeeC, ParkH, MoonS. What is twitter, a social network or a news media [C]. Proceedings of the 19th International Conference on World Wide Web, 2010, USA, ACM Press: 591-600

[5]	StrattonJ A, RodriguesC, SungI J, ObeidN, ChangL W, AnssariN, LiuG D, HwuW M WParboil: A revised benchmark suite for scientific and commercial throughput computing [R], 2012, Illinois, Urbana, Center for Reliable and High-Performance Computing

[6]	Graph 500 Steering Committee.The Graph 500 List [EB/OL]. [2013-08-15], 2013

[7]	AgarwalV, PetriniF, PasettoD, BaderD A. Scalable graph exploration on multicore processors [C]. Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010, USA, IEEE Computer Society: 1-11

[8]	GaoT, LuY, ZhangB, SuoG. Using MIC to accelerate a typical data-intensive application: The breadth-first search [C]. Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International, 2013, USA, IEEE Computer Society: 1117-1125

[9]	HongS, KimS K, OguntebiT, OlukotunK. Accelerating CUDA graph algorithms at maximum warp [C]. Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming, 2011, USA, ACM Press: 267-276

[10]	ZouD, DouY, GuoS, NiS. High performance sparse matrix-vector multiplication on FPGA [J]. IEICE Electronics Express, 2013, 10(17): 20130529

[11]	YangC-q, WuQ, HuH-l, ShiZ-c, ChenJ, TangTao. Fast weighting method for plasma PIC simulation on GPU-accelerated heterogeneous systems [J]. Journal of Central South University, 2013, 20(6): 1527-1535

[12]	TicknerJ. Monte Carlo simulation of X-ray and gamma-ray photon transport on a graphics-processing unit [J]. Computer Physics Communications, 2010, 181(11): 1821-1832

[13]	HarishP, NarayananP JAccelerating large graph algorithms on the GPU using CUDA [M], 2007, Berlin, Springer: 197-208

[14]	LuoL, WongM, HwuW. An effective GPU implementation of breadth-first search [C]. Proceedings of the 47th Design Automation Conference, 2010, USA, ACM Press: 52-55

[15]	MerrillD, GarlandM, GrimshawA. Scalable GPU graph traversal [C]. Proceedings of the 17th ACM Symposium on Principles and Practice of Parallel Programming, 2012, USA, ACM Press: 117-128

[16]	BaderD A, MadduriK. SNAP, small-world network analysis and partitioning: An open-source parallel graph framework for the exploration of large-scale networks [C]. Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, 2008, USA, IEEE Computer Society: 1-12

[17]	NvidiaCNVIDIA’s next generation CUDA compute architecture: kepler GK110 [EB/OL], 2013

[18]	LeisersonC E, RivestR L, SteinCIntroduction to algorithms [M], 2001, Massachusetts, The MIT Press: 534-535

[19]	NvidiaCCompute unified device architecture programming guide [M], 2010, Santa Clara, NVIDIA Corporation: 3-5

[20]	BlellochG EPrefix sums and their applications [R], 1990, Pittsburgh, Carnegie Mellon University

[21]	BeamerS, AsanovicK, PattersonD. Direction-optimizing breadth-first search [C]. High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, 2012, USA, IEEE Computer Society: 1-10