Efficient query processing framework for big data warehouse: an almost join-free approach
Huiju WANG, Xiongpai QIN, Xuan ZHOU, Furong LI, Zuoyan QIN, Qing ZHU, Shan WANG
Efficient query processing framework for big data warehouse: an almost join-free approach
The rapidly increasing scale of data warehouses is challenging today’s data analytical technologies. A conventional data analytical platform processes data warehouse queries using a star schema — it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users’ demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries.
In this paper, we propose a new query processing framework for data warehouses. It pushes the join operations partially to the pre-processing phase and partially to the postprocessing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation operations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and stable despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data warehouse. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.
data warehouse / large scale / TAMP / join-free / multi-version schema
[1] |
Chaudhuri S, Dayal U. An overview of data warehousing and olap technology. SIGMOD Record, 1997, 26(1): 65-74
CrossRef
Google scholar
|
[2] |
Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation. 2004, 137-150
|
[3] |
Apache hadoop. http://hadoop.apache.org
|
[4] |
Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data. 2009, 165-178
CrossRef
Google scholar
|
[5] |
Afrati F N, Ullman J D. Optimizing joins in a map-reduce environment. In: Proceedings of the 2010 International Conference on Extending Databas Technology. 2010, 99-110
CrossRef
Google scholar
|
[6] |
Dawei Jiang G CA. K. H. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1299-1311
CrossRef
Google scholar
|
[7] |
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a notso-foreign language for data processing. In: Proceedings of the 2008 SIGMOD International Conference on Management of Data. 2008, 1099-1110
CrossRef
Google scholar
|
[8] |
Dittrich J, Quiané-Ruiz J A, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 2010, 3(1): 518-529
CrossRef
Google scholar
|
[9] |
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techniques for mapreduce. Proceedings of the VLDB Endowent, 2011, 4(7): 419-429
CrossRef
Google scholar
|
[10] |
Lin Y, Agrawal D, Chen C, Ooi B C, Wu S. LLAMA: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 SIGMOD International Conference on Management of Data. 2011, 961-972
CrossRef
Google scholar
|
[11] |
Xu Y, Kostamaa P, Gao L. Integrating hadoop and parallel DBMS. In: Proceedings of the 2010 SIGMOD Conference on Management of Data. 2010, 969-974
CrossRef
Google scholar
|
[12] |
Abouzeid A, Bajda-Pawlikowski K, Abadi D J, Rasin A, Silberschatz A. Hadoopdb: An architectural hybrid of mapreduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2009, 2(1): 922-933
CrossRef
Google scholar
|
[13] |
Swami A, Gupta A. Optimization of large join queries. SIGMOD Record, 1988, 17(3): 8-17
CrossRef
Google scholar
|
[14] |
Raman V, Swart G, Qiao L, Reiss F, Dialani V, Kossmann D, Narang I, Sidle R. Constant-time query processing. In: Proceedings of the 2008 International Conference of Data Engineering. 2008, 60-69
CrossRef
Google scholar
|
[15] |
Valduriez P. Join indices. ACM Transactions on Database Systems, 1987, 12: 218-246
CrossRef
Google scholar
|
[16] |
Markl V, Ramsak F, Bayer R. Improving OLAP performance by multidimensional hierarchical clustering. In: Proceedings of the 1999 International Symposium on Database Engineering and Applications. 1999, 165-177
CrossRef
Google scholar
|
[17] |
Karayannidis N, Tsois A, Sellis T K, Pieringer R, Markl V, Ramsak F, Fenk R, Elhardt K, Bayer R. Processing star queries on hierarchically-clustered fact tables. In: Proceedings of the 28th VLDB Conference. 2002, 730-741
CrossRef
Google scholar
|
[18] |
Bayer R. The universal b-tree for multidimensional indexing: general concepts. In: Proceedings of the 1997 International Conference on Worldwide Computing and Its Applications. 1997, 198-209
CrossRef
Google scholar
|
[19] |
Theodoratos D, Tsois A. Heuristic optimization of olap queries in multidimensionally hierarchically clustered databases. In: Proceedings of ACM 4th International Workshop on Data Warehousing and OLAP. 2001, 48-55
CrossRef
Google scholar
|
[20] |
Korth H F, Kuper G M, Feigenbaum J, Gelder A V, Ullman J D. System/u: A database system based on the universal relation assumption. ACM Transactions on Database Systems, 1984, 9(3): 331-347
CrossRef
Google scholar
|
[21] |
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techn<?Pub Caret?>iques for mapreduce. Proceedings of the VLDB Endowent, 2011, 4(7): 419-429
CrossRef
Google scholar
|
/
〈 | 〉 |