iGraph: an incremental data processing system for dynamic graph
Wuyang JU, Jianxin LI, Weiren YU, Richong ZHANG
iGraph: an incremental data processing system for dynamic graph
With the popularity of social network, the demand for real-time processing of graph data is increasing. However, most of the existing graph systems adopt a batch processing mode, therefore the overhead of maintaining and processing of dynamic graph is significantly high. In this paper, we design iGraph, an incremental graph processing system for dynamic graph with its continuous updates. The contributions of iGraph include: 1) a hash-based graph partition strategy to enable fine-grained graph updates; 2) a vertexbased graph computing model to support incremental data processing; 3) detection and rebalance methods of hotspot to address the workload imbalance problem during incremental processing. Through the general-purpose API, iGraph can be used to implement various graph processing algorithms such as PageRank. We have implemented iGraph on Apache Spark, and experimental results show that for real life datasets, iGraph outperforms the original GraphX in respect of graph update and graph computation.
big data / distributed system / in-memory computing / graph processing / hotspot detection
[1] |
Shao Y, Cui B, Ma L. PAGE: a partition aware engine for parallel graph computation. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(2): 518–530
|
[2] |
Malewicz G, Austern M H, Bik A J C, Dehnert J C, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2010, 135–146
|
[3] |
Salihoglu S, Widom J. GPS: a graph processing system. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management. 2013, 22:1–22:12
|
[4] |
Power R, Li J Y. Piccolo: building fast, distributed programs with partitioned tables. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2010, 293–306
|
[5] |
Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J M. Graphlab: a new framework for parallel machine learning. In: Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence. 2010, 340–349
|
[6] |
Pearce R A, Gokhale M, Amato N M. Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In: Proceedings of the ACM/IEEE International Conference for High Performance Computing Networking, Storage and Analysis. 2010, 1–11
|
[7] |
Kang U, Tsourakakis C E, Faloutsos C. PEGASUS: a peta-scale graph mining system. In: Proceedings of the 9th IEEE International Conference on Data Mining. 2009, 229–238
|
[8] |
Gonzalez J E, Low Y, Gu H, Bickson D, Guestrin C. Powergraph: distributed graph-parallel computation on natural graphs. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation. 2012, 17–30
|
[9] |
Ching A, Edunov S, Kabiljo M, Logothetis D, Muthukrishnan S. One trillion edges: graph processing at Facebook-scale. Proceedings of the VLDB Endowment, 2015, 8(12): 1804–1815
|
[10] |
Yan D, Cheng J, Lu Y, Ng W. Blogel: a block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment, 2014, 7(14): 1981–1992
|
[11] |
Zhang Y, Liao X F, Jin H, Lin L, Lu F. An adaptive switching scheme for iterative computing in the cloud. Frontiers of Computer Science, 2014, 8(6): 872–884
|
[12] |
Zheng X L, Zhong Y G, Zeng D, Wang F Y. Social influence and spread dynamics in social networks. Frontiers of Computer Science, 2012, 6(5): 611–620
|
[13] |
Kumar R, Novak J, Tomkins A. Structure and evolution of online social networks. In: Philip S Y, Han J, Faloutsos C, eds. Link Mining: Models, Algorithms, and Applications. New York: Springer, 2010, 337–357
|
[14] |
Yan D, Cheng J, Lu Y, Ng W. Effective techniques for message reduction and load balancing in distributed graph computation. In: Proceedings of the 24th International Conference on World Wide Web. 2015, 1307–1317
|
[15] |
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauly M, Franklin M J, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation. 2012, 15–28
|
[16] |
Gonzalez J E, Xin R S, Dave A, Crankshaw D, Franklin M J, Stoica I. Graphx: graph processing in a distributed dataflow framework. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2014, 599–613
|
[17] |
Ma S, Li J, Hu C M, Lin X L, Huai J P. Big graph search: challenges and techniques. Frontiers of Computer Science, 2016, 10(3): 387–398
|
[18] |
Cheatham T, Fahmy A F, Stefanescu D C, Valiant L G. Bulk synchronous parallel computing —a paradigm for transportable software. In: Proceedings of Annual Hawaii International Conference on System Sciences. 1995, 268–275
|
[19] |
Low Y, Gonzalez J, Kyrola A, Bickson D, Guestrin C, Hellerstein J M. Distributed graphlab: a framework for machine learning in the cloud. Proceedings of the VLDB Endowment, 2012, 5(8): 716–727
|
[20] |
Pujol J M, Erramilli V, Siganos G, Yang X, Laoutaris N, Chhabra P, Rodriguez P.The little engine(s) that could: scaling online social networks. ACM SIGCOMM Computer Communication Review, 2011, 41(4): 375–386
|
[21] |
Mondal J, Deshpande A. Managing large dynamic graphs efficiently. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2012, 145–156
|
[22] |
Yang S, Yan X, Zong B, Khan A. Towards effective partition management for large graphs. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2012, 517–528
|
[23] |
Bu Y, Howe B, Balazinska M, Ernst M D. Haloop: efficient iterative data processing on large clusters. Proceedings of the VLDB Endowment, 2010, 3(1): 285–296
|
[24] |
Logothetis D, Olston C, Reed B, Webb K C, Yocum K. Stateful bulk processing for incremental analytics. In: Proceedings of the 1st ACM Symposium on Cloud Computing. 2010, 51–62
|
[25] |
Popa L, Budiu M, Yu Y, Isard M. Dryadinc: reusing work in largescale computations. In: Proceedings of Workshop on Hot Topics in Cloud Computing. 2009
|
[26] |
Gunda P K, Ravindranath L, Thekkath C A, Yu Y, Zhuang L. Nectar: automatic management of data and computation in datacenters. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2010, 75–88
|
[27] |
Bhatotia P, Wieder A, Rodrigues R, Acar U A, Pasquin R. Incoop: MapReduce for incremental computations. In: Proceedings of the 2nd ACM Symposium on Cloud Computing. 2011
|
[28] |
Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of USENIX Symposium on Operating Systems Design and Implementation. 2010, 251–264
|
[29] |
Murray D G, McSherry F, Isaacs R, Isard M, Barham P, Abadi M. Naiad: a timely dataflow system. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 439–455
|
[30] |
Roy A, Mihailovic I, Zwaenepoel W. X-stream: edge-centric graph processing using streaming partitions. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 472–488
|
[31] |
Kyrola A, Blelloch G E, Guestrin C. Graphchi: large-scale graph computation on just a PC. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation. 2012, 31–46
|
[32] |
Cheng R, Hong J, Kyrola A, Miao Y, Weng X, Wu M, Yang F, Zhou L, Zhao F, Chen E. Kineograph: taking the pulse of a fast-changing and connected world. In: Proceedings of the 7th ACM European Conference on Computer Systems. 2012, 85–98
|
[33] |
Zaharia M, Das T, Li H, Hunter T, Shenker S, Stoica I. Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles. 2013, 423–438
|
[34] |
Çatalyürek Ü V, Aykanat C, Uçar B. On two-dimensional sparse matrix partitioning: models, methods, and a recipe. SIAM Journal on Scientific Computing, 2010, 32(2): 656–683
|
[35] |
Page L, Brin S, Motwani R, Winograd T. The PageRank citation ranking: bringing order to the web. Technical Report. 1999
|
/
〈 | 〉 |