A parallel computing framework for big data
Guoliang CHEN, Rui MAO, Kezhong LU
A parallel computing framework for big data
Big data has received great attention in research and application. However, most of the current efforts focus on system and application to handle the challenges of “volume” and “velocity”, and not much has been done on the theoretical foundation and to handle the challenge of “variety”. Based on metric-space indexing and computationalcomplexity theory, we propose a parallel computing framework for big data. This framework consists of three components, i.e., universal representation of big data by abstracting various data types into metric space, partitioning of big data based on pair-wise distances in metric space, and parallel computing of big data with the NC-class computing theory.
NC-computing / metric space / data partitioning / parallel computing
[1] |
ZhouB, Liu W, FanC . Big Data: Strategy, Technology and Practice. Beijing: Publishing House of Electronics Industry, 2013
|
[2] |
TuZ. Big Data. Guilin: Guangxi Normal University Press, 2012
|
[3] |
FanW, GeertsF, NevenF. Making queries tractable on big data with preprocessing: through the eyes of complexity theory. Proceedings of the VLDB Endowment, 2013, 6(9): 685–696
CrossRef
Google scholar
|
[4] |
ChenG. Design and Analysis of Parallel Algorithms. Beijing: Higher Education Press, 2011
|
[5] |
CookS A. Deterministic CFL’s are accepted simultaneously in polynomial time and log squared space. In: Proceedings of the 11th Annual ACM Symposium on Theory of Computing. 1979, 338–345
CrossRef
Google scholar
|
[6] |
MaoR, XuH L, WuW B, Li J Q, LiY , LuM H. Overcoming the challenge of variety: big data abstraction, the next evolution of data management for AAL communication system. IEEE Communications Magazine, 2015, 53(1): 42–47
CrossRef
Google scholar
|
[7] |
XiongJ, LuJ, TanF. Topology. Beijing:China Machine Press, 2013
|
[8] |
ChávezE, Navarro G, Baeza-YatesR , MarroquínJ L. Searching in metric spaces. ACM Computing Surveys, 2001, 33(3): 273–321
CrossRef
Google scholar
|
[9] |
ZezulaP, AmatoG, DohnalV, Batko M. Similarity Search: the Metric Space Approach. Springer Science & Business Media, 2006
|
[10] |
MaoR, ZhangP H, LiX L, Liu X, LuM H . Pivot selection for metricspace indexing. International Journal of Machine Learning and Cybernetics, 2016, 7(2): 311–323
CrossRef
Google scholar
|
[11] |
MaoR, Miranker W, MirankerD P . Pivot selection: dimension reduction for distance-based indexing. Journal of Discrete Algorithms, 2012, 13: 32–46
CrossRef
Google scholar
|
[12] |
ShiH M, Schaeffer J. Parallel sorting by regular sampling. Journal of Parallel and Distributed Computing, 1992, 14(4): 361–372
CrossRef
Google scholar
|
[13] |
ValiantL G. Parallelism in comparison problems. SIAM Journal on Computing, 1975, 4(3): 348–355
CrossRef
Google scholar
|
[14] |
ShiloachY, Vishkin U. Finding the maximum. Merging and Sorting in a Parallel Computational Model, 1981, 2(1): 88–102
|
[15] |
ChenG. Balanced (n, m)-selection networks. Computer Research and Development, 1984, 21(11): 9–22
|
[16] |
Samet, H. Foundations of Multidimensional and Metric Data Structures. San Francisco: Morgan-Kaufmann, 2006
|
[17] |
HjaltasonG R, SametH. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 2003. 28(4): 517–580
CrossRef
Google scholar
|
[18] |
UhlmannJ K. Satisfying general proximity/similarity queries withmetric trees. Information Processing Letter, 1991, 40(4): 175–179
CrossRef
Google scholar
|
[19] |
YianilosP N. Data structures and algorithms for nearest neighbor search in general metric spaces. In: Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms. 1993
|
[20] |
MaoR, LiuS, XuH L, Zhang D, MirankerD P . On data partitioning in tree structure metric-space indexes. In: Proceedings of the 19th International Conference on Database Systems for Advanced Applications. 2014, 141–155
CrossRef
Google scholar
|
[21] |
BozkayaT, Ozsoyoglu M. Indexing large metric spaces for similarity search queries. ACM Transactions on Database Systems, 1999, 24(3): 361–404
CrossRef
Google scholar
|
[22] |
SergeyB. Near neighbor search in large metric spaces. In: Proceedings of the 21st International Conference on Very Large Data Bases. 1995
|
[23] |
CiacciaP, Patella M, ZezulaP . M-tree: an efficient access method for similarity search in metric spaces. In: proceedings of the 23rd International Conference on Very Large Data Bases. 1997
|
[24] |
NavarroG. Searching in metric spaces by spatial approximation. The VLDB Journal, 2002, 11(1): 28–46
CrossRef
Google scholar
|
[25] |
JagadishH V, OoiB C, TanK L, Yu C, ZhangR . iDistance: an adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems (TODS), 2005, 30(2): 364–397
CrossRef
Google scholar
|
[26] |
ChavezE, Navarro G. Unbalancing: the key to index high dimensional metric spaces. Technical Report. 1999
|
[27] |
MaoR, XuW, RamakrishnanS . NuckollsG, Miranker D P. On optimizing distance-based similarity search for biological databases. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference. 2005, 351–361
CrossRef
Google scholar
|
[28] |
MacQueenJ B. Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967, 281–297
|
[29] |
EsterM, Kriegel H P, SanderJ , XuX. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM SIGKDD Conference. 1996. 226–231
|
[30] |
CullerD, KarpR, PattersonD, Sahay A, SchauserK E , SantosE, Subramonian R, von EickenT . LogP: towards a realistic model of parallel computation. In: Proceedings of the 4th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1993, 1–12
CrossRef
Google scholar
|
[31] |
TuB B, ZouM, ZhanJ F, Zhao X F, FanJ P . Research on parallel computation model with memory hierarchy on multi-core clusters. Chinese Journal of Computers, 2008, 31(11): 1948–1955
CrossRef
Google scholar
|
[32] |
LadnerR E. The circuit value problem is log space complete for P. ACM SIGACT News, 1975, 7(1): 18–20
CrossRef
Google scholar
|
[33] |
PippengerN J. On simultaneous resource bounds. In: Proceedings of the 20th IEEE Annual Symposium on Foundations of Computer Science. 1979, 307–311
CrossRef
Google scholar
|
[34] |
ChandraA K, Stockmeyer L J. Alternation. In: Proceedings of the 17th IEEE Annual Symposium on Foundations of Computer Science. 1976, 98–108
CrossRef
Google scholar
|
[35] |
GoldschlagerL M. Synchronous parallel computation. Dissertation for the Doctoral Degree. Toronto: University of Toronto, 1977
|
/
〈 | 〉 |