Guaranteeing the response deadline for general aggregation trees
Jiangfan LI, Chendie YAO, Junxu XIA, Deke GUO
Guaranteeing the response deadline for general aggregation trees
It is essential to provide responses to queries within time deadlines, even if not exact and complete. To reduce the query latency, systems usually partition large-scale data computations as a series of tasks over many processes and aggregate them to reduce the response time by using aggregation trees. An obstacle is that the involved processes of a query usually differ in their speeds, thus not all processes can complete their tasks in time. This would directly degrade the response quality (the number of outputs received by the root of an aggregation tree). In this paper, we propose a general aggregation tree model, Tarot, to maximize the response quality by systematically addressing the following challenging issues: (1) fine-grained partition of the query deadline along the multi-level aggregation tree; (2) learning the distribution of durations at each level in the aggregation tree to optimize the wait durations at aggregators; (3) adaptively reassigning tasks over processes according to their status; (4) performing periodic aggregation of received outputs from the low level to avoid missing the deadline. The prior model does not consider the four aspects simultaneously. Extensive evaluations indicate that Tarot can adapt to multi-level trees and considerably improve the response quality compared to prior work while guaranteeing the query deadline.
aggregation query / performance variations / tasks reassignment
[1] |
Guo D, Xie J, Zhou X, Zhu X, Wei W, Luo X. Exploiting efficient and scalable shuffle transfers in future data center networks. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(4): 997–1009
CrossRef
Google scholar
|
[2] |
Yuan Y, Wang G, Chen L,Wang H. Efficient keyword search on uncertain graph data. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(12): 2767–2779
CrossRef
Google scholar
|
[3] |
Yuan Y, Wang G, Chen L, Wang H. Graph similarity search on large uncertain graph databases. The International Journal on Very Large Data Bases, 2015, 24(2): 271–296
CrossRef
Google scholar
|
[4] |
Agarwal S, Iyer A P, Panda A, Madden S, Mozafari B, Stoica I. Blink and it’s done: interactive queries on very large data. Proceedings of the VLDB Endowment, 2012, 5(12): 1902–1905
CrossRef
Google scholar
|
[5] |
Abe T, Ueda T, Abe K, Ishibashi H, Matsuura T. Aggregation skip graph: a skip graph extension for efficient aggregation query over P2P networks. International Journal on Advances in Internet Technology, 2012, 4(3–4): 103–110
|
[6] |
Ananthanarayanan G, Hung M C, Ren X, Stoica I, Wierman A, Yu M. GRASS: trimming stragglers in approximation analytics. In: Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014, 289–302
|
[7] |
Ding Z, Guo D, Liu X, Luo X, Chen G. A mapreduce-supported network structure for data centers. Concurrency and Computation: Practice and Experience, 2012, 24(12): 1271–1295
CrossRef
Google scholar
|
[8] |
Naimi A I, Daniel W. Big data: a revolution that will transform how we live, work, and think. American Journal of Epidemiology. 2014, 179(9): 1143–1144
CrossRef
Google scholar
|
[9] |
Yuan Y, Wang G, Yu X J, Chen L. Efficient distributed subgraph similarity matching. The International Journal on Very Large Data Bases, 2015, 24: 369–394
CrossRef
Google scholar
|
[10] |
Kumar G, Ananthanarayanan G, Ratnasamy S, Stoica I. Hold ’em or fold ’em?: aggregation queries under performance variations. In: Proceedings of the 11th European Conference on Computer Systems. 2016
CrossRef
Google scholar
|
[11] |
Dean J, Barroso L A. The tail at scale. Communications of the ACM, 2013, 56(2): 74–80
CrossRef
Google scholar
|
[12] |
Guo D, Li M. Set reconciliation via counting bloom filters. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(10): 2367–2380
CrossRef
Google scholar
|
[13] |
David H A. Order Statistics; 3rd ed. USA: Wiley, 2003
CrossRef
Google scholar
|
[14] |
Guo D, Wu J, Liu Y, Jin H, Chen H, Chen T. Quasi-kautz digraphs for peer-to-peer networks. IEEE Transactions on Parallel and Distributed Systems, 2010, 22(6): 1042–1055
CrossRef
Google scholar
|
[15] |
Luo L, Guo D, Ma R T B, Rottenstreich O, Luo X. Optimizing bloom filter: challenges, solutions, and comparisons. IEEE Communications Surveys and Tutorials, 2019, 21(2): 1912–1949
CrossRef
Google scholar
|
[16] |
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation. 2004
|
[17] |
Zaharia M, Konwinski A, Joseph A D, Katz R, Stoica I. Improving mapreduce performance in heterogeneous environments. In: Proceedings of USENIX Conference on Operating Systems Design and Implementation. 2008, 29–42
|
[18] |
Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of IEEE Symposium onMass Storage Systems and Technologies. 2010, 1–10
CrossRef
Google scholar
|
[19] |
Asanovic K, Bodík R, Demmel J, Keaveny T, Keutzer K, Kubiatowicz J, Morgan N, Patterson D, Sen K, Wawrzynek J, Wessel D, Yelick K A. A view of the parallel computing landscape. Communications of the ACM, 2009, 52(10): 56–67
CrossRef
Google scholar
|
[20] |
Ding Z, Guo D, Xue L, Luo X, Chen G. A mapreduce-supported network structure for data centers. Concurrency and Computation Practice and Experience, 2012, 24(12): 1271–1295
CrossRef
Google scholar
|
[21] |
Yuan Y, Lian X, Chen L, Sun Y, Wang G. RSkNN: kNN search on road networks by incorporating social influence. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(6): 1575–1588
CrossRef
Google scholar
|
[22] |
Liao S, Chen L, Li J, Xiong W, Wu Q. A spatiotemporal aggregation query method using multi-thread parallel technique based on regional division. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015, 2(4): 1
CrossRef
Google scholar
|
[23] |
Tao Y, Kollios G, Considine J, Li F, Papadias D. Spatio-temporal aggregation using sketches. In: Proceedings of International Conference on Data Engineering. 2004, 214–225
|
[24] |
Zhang Z, Hui J, Xie X, Pan H, Feng X. An online approximate aggregation query processing method based on hadoop. In: Proceedings of International Conference on Computer Supported Cooperative Work in Design. 2016, 117–122
CrossRef
Google scholar
|
[25] |
Yuan Y, Lian X, Chen L, Yu J, Wang G, Sun Y. Keyword search over distributed graphs. IEEE Transactions on Knowledge and Data Engineering, 2017, 29(6): 1212–1225
CrossRef
Google scholar
|
[26] |
Zhang D, Chan C Y, Tan K L. Processing spatial keyword query as a top-k aggregation query. In: Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. 2014, 355–364
CrossRef
Google scholar
|
[27] |
Rogge-Solti A, Weske M. Prediction of remaining service execution time using stochastic petri nets with arbitrary firing delays. In: Proceedings of International Conference on Service-Oriented Computing. 2013, 389–403
CrossRef
Google scholar
|
[28] |
Alinia B, Hajiesmaili M H, Khonsari A, Crespi N. Maximum-quality tree construction for deadline-constrained aggregation in WSNs. IEEE Sensors Journal, 2017, 17(12): 3930–3943
CrossRef
Google scholar
|
[29] |
Xu Y, Musgrave Z, Noble B, Bailey M. Bobtail: avoiding long tails in the cloud. In: Proceedings of USENIX Conference on Networked Systems Design and Implementation. 2013, 329–342
|
[30] |
Alizadeh M, Greenberg A G, Maltz D A, Padhye J, Patel P, Prabhakar B, Sengupta S, Sridharan M. Data center TCP (DCTCP). In: Proceedings of the ACM Special Interest Group on Data Communication. 2010, 63–74
CrossRef
Google scholar
|
[31] |
Ananthanarayanan G, Ghodsi A, Warfield A, Borthakur D, Kandula S, Shenker S, Stoica I. Pacman: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation. 2012, 267–280
|
[32] |
Isard M, Prabhakaran V, Currey J, Wieder U, Talwar K, Goldberg A. Quincy: fair scheduling for distributed computing clusters. In: Proceeds of IEEE International Conference on Recent Trends in Information Systems. 2009, 261–276
CrossRef
Google scholar
|
[33] |
Kavulya S, Tan J, Gandhi R, Narasimhan P. An analysis of traces from a production mapreduce cluster. In: Proceedings of IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. 2010, 94–103
CrossRef
Google scholar
|
[34] |
Wilson C, Ballani H, Karagiannis T, Rowstron A I T. Better never than late: meeting deadlines in datacenter networks. In: Proceedings of the ACM Special Interest Group on Data Communication. 2011, 50–61
CrossRef
Google scholar
|
[35] |
Xiao W, Bao W, Zhu X, Liu L. Cost-aware big data processing across geo-distributed datacenters. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(11): 3114–3127
CrossRef
Google scholar
|
[36] |
Tang G, Wu K, Brunner R. Rethinking cdn design with distributed time-varying traffic demands. In: Proceedings of International Conference on Computer Communications. 2017, 1–9
CrossRef
Google scholar
|
[37] |
Tang G, Wang H, Wu K, Guo D. Tapping the knowledge of dynamic traffic demands for optimal CDN design. IEEE/ACM Transactions on Networking, 2019, 27(1): 98–111
CrossRef
Google scholar
|
[38] |
Melnik S, Gubarev A, Long J J, Romer G, Shivakumar S, Tolton M, Vassilakis T. Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 2010, 3(1–2): 330–339
CrossRef
Google scholar
|
/
〈 | 〉 |