Efficient query processing framework for big data warehouse: an almost join-free approach

Huiju WANG, Xiongpai QIN, Xuan ZHOU, Furong LI, Zuoyan QIN, Qing ZHU, Shan WANG

PDF(527 KB)
PDF(527 KB)
Front. Comput. Sci. ›› 2015, Vol. 9 ›› Issue (2) : 224-236. DOI: 10.1007/s11704-014-4025-6
RESEARCH ARTICLE

Efficient query processing framework for big data warehouse: an almost join-free approach

Author information +
History +

Abstract

The rapidly increasing scale of data warehouses is challenging today’s data analytical technologies. A conventional data analytical platform processes data warehouse queries using a star schema — it normalizes the data into a fact table and a number of dimension tables, and during query processing it selectively joins the tables according to users’ demands. This model is space economical. However, it faces two problems when applied to big data. First, join is an expensive operation, which prohibits a parallel database or a MapReduce-based system from achieving efficiency and scalability simultaneously. Second, join operations have to be executed repeatedly, while numerous join results can actually be reused by different queries.

In this paper, we propose a new query processing framework for data warehouses. It pushes the join operations partially to the pre-processing phase and partially to the postprocessing phase, so that data warehouse queries can be transformed into massive parallelized filter-aggregation operations on the fact table. In contrast to the conventional query processing models, our approach is efficient, scalable and stable despite of the large number of tables involved in the join. It is especially suitable for a large-scale parallel data warehouse. Our empirical evaluation on Hadoop shows that our framework exhibits linear scalability and outperforms some existing approaches by an order of magnitude.

Keywords

data warehouse / large scale / TAMP / join-free / multi-version schema

Cite this article

Download citation ▾
Huiju WANG, Xiongpai QIN, Xuan ZHOU, Furong LI, Zuoyan QIN, Qing ZHU, Shan WANG. Efficient query processing framework for big data warehouse: an almost join-free approach. Front. Comput. Sci., 2015, 9(2): 224‒236 https://doi.org/10.1007/s11704-014-4025-6

References

[1]
Chaudhuri S, Dayal U. An overview of data warehousing and olap technology. SIGMOD Record, 1997, 26(1): 65-74
CrossRef Google scholar
[2]
Dean J, Ghemawat S. Mapreduce: Simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation. 2004, 137-150
[3]
Apache hadoop. http://hadoop.apache.org
[4]
Pavlo A, Paulson E, Rasin A, Abadi D J, DeWitt D J, Madden S, Stonebraker M. A comparison of approaches to large-scale data analysis. In: Proceedings of the 35th SIGMOD International Conference on Management of Data. 2009, 165-178
CrossRef Google scholar
[5]
Afrati F N, Ullman J D. Optimizing joins in a map-reduce environment. In: Proceedings of the 2010 International Conference on Extending Databas Technology. 2010, 99-110
CrossRef Google scholar
[6]
Dawei Jiang G CA. K. H. Map-join-reduce: Towards scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1299-1311
CrossRef Google scholar
[7]
Olston C, Reed B, Srivastava U, Kumar R, Tomkins A. Pig latin: a notso-foreign language for data processing. In: Proceedings of the 2008 SIGMOD International Conference on Management of Data. 2008, 1099-1110
CrossRef Google scholar
[8]
Dittrich J, Quiané-Ruiz J A, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 2010, 3(1): 518-529
CrossRef Google scholar
[9]
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techniques for mapreduce. Proceedings of the VLDB Endowent, 2011, 4(7): 419-429
CrossRef Google scholar
[10]
Lin Y, Agrawal D, Chen C, Ooi B C, Wu S. LLAMA: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 SIGMOD International Conference on Management of Data. 2011, 961-972
CrossRef Google scholar
[11]
Xu Y, Kostamaa P, Gao L. Integrating hadoop and parallel DBMS. In: Proceedings of the 2010 SIGMOD Conference on Management of Data. 2010, 969-974
CrossRef Google scholar
[12]
Abouzeid A, Bajda-Pawlikowski K, Abadi D J, Rasin A, Silberschatz A. Hadoopdb: An architectural hybrid of mapreduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment, 2009, 2(1): 922-933
CrossRef Google scholar
[13]
Swami A, Gupta A. Optimization of large join queries. SIGMOD Record, 1988, 17(3): 8-17
CrossRef Google scholar
[14]
Raman V, Swart G, Qiao L, Reiss F, Dialani V, Kossmann D, Narang I, Sidle R. Constant-time query processing. In: Proceedings of the 2008 International Conference of Data Engineering. 2008, 60-69
CrossRef Google scholar
[15]
Valduriez P. Join indices. ACM Transactions on Database Systems, 1987, 12: 218-246
CrossRef Google scholar
[16]
Markl V, Ramsak F, Bayer R. Improving OLAP performance by multidimensional hierarchical clustering. In: Proceedings of the 1999 International Symposium on Database Engineering and Applications. 1999, 165-177
CrossRef Google scholar
[17]
Karayannidis N, Tsois A, Sellis T K, Pieringer R, Markl V, Ramsak F, Fenk R, Elhardt K, Bayer R. Processing star queries on hierarchically-clustered fact tables. In: Proceedings of the 28th VLDB Conference. 2002, 730-741
CrossRef Google scholar
[18]
Bayer R. The universal b-tree for multidimensional indexing: general concepts. In: Proceedings of the 1997 International Conference on Worldwide Computing and Its Applications. 1997, 198-209
CrossRef Google scholar
[19]
Theodoratos D, Tsois A. Heuristic optimization of olap queries in multidimensionally hierarchically clustered databases. In: Proceedings of ACM 4th International Workshop on Data Warehousing and OLAP. 2001, 48-55
CrossRef Google scholar
[20]
Korth H F, Kuper G M, Feigenbaum J, Gelder A V, Ullman J D. System/u: A database system based on the universal relation assumption. ACM Transactions on Database Systems, 1984, 9(3): 331-347
CrossRef Google scholar
[21]
Floratou A, Patel J M, Shekita E J, Tata S. Column-oriented storage techn<?Pub Caret?>iques for mapreduce. Proceedings of the VLDB Endowent, 2011, 4(7): 419-429
CrossRef Google scholar

RIGHTS & PERMISSIONS

2014 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(527 KB)

Accesses

Citations

Detail

Sections
Recommended

/