Fast correlation coefficient estimation algorithm for HBase-based massive time series data

Wen LIU; Tuqian ZHANG; Yanming SHEN; Peng WANG

doi:10.1007/s11704-018-6308-9

Front. Comput. Sci. ›› 2019, Vol. 13 ›› Issue (4) :864 -878. DOI: 10.1007/s11704-018-6308-9

RESEARCH ARTICLE

Fast correlation coefficient estimation algorithm for HBase-based massive time series data

Author information +

History +

PDF (725KB)

Abstract

In recent years, the rapid development of Internet of Things and sensor networks makes the time series data experiencing explosive growth. OpenTSDB and other emerging systems begin to use Hadoop, HBase to store massive time series data, and how to use these platforms to query and mine time series data has become a current research hotspot. As a typical time series distance measurementmethod, correlation coefficient is widely used in various applications. However, it requires a large amount of I/O and network transmission to compute the correlation coefficient of long time sequence on HBase in real time, and therefore cannot be applied to interactive query. To address this problem, in this paper, we present two methods to estimate the correlation coefficients of two sequences on HBase. We first propose a fast estimation algorithm for the upper and lower bounds of correlation coefficient, named as DCE. In order to further reduce the cost of I/O, we extend the DCE algorithm, and propose the ADCE algorithm, which can estimate the correlation coefficient quickly with an iterative manner. Experiments show that the algorithms proposed in this paper can quickly calculate the correlation coefficient of the long time series.

Keywords

time series / HBase / correlation coefficient / fast estimation

Cite this article

Download citation ▾

Wen LIU, Tuqian ZHANG, Yanming SHEN, Peng WANG. Fast correlation coefficient estimation algorithm for HBase-based massive time series data. Front. Comput. Sci., 2019, 13(4): 864-878 DOI:10.1007/s11704-018-6308-9

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Mueen A, Nath S, Liu J. Fast approximate correlation for massive timeseries data. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 171–182

[2]	Tao Y F, Papadias D, Faloutsos C. Approximate temporal aggregation. In: Proceedings of the 20th IEEE International Conference on Data Engineering. 2004, 190–201

[3]	Tao Y F, Yi K, Sheng C, Pei J, Li F F. Logging every footstep: quantile summaries for the entire history. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 639–650

[4]	Esling P, Agon C. Time-series data mining. ACM Computing Surveys, 2012, 45(1): 12

[5]	Camerra A, Palpanas T, Shieh J, Keogh E. iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 10th IEEE International Conference on Data Mining. 2010, 58–67

[6]	Yang J, Widom J. Incremental computation and maintenance of temporal aggregates. The VLDB Journal — The International Journal on Very Large Data Bases, 2003, 12(3): 262–283

[7]	Jin J, An N, Sivasubramaniam A. Analyzing range queries on spatial data. In: Proceedings of the 16th IEEE International Conference on Data Engineering. 2000, 525–534

[8]	Mueen A, Hamooni H, Estrada T. Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE International Conference on Data Mining. 2014, 450–459

[9]	Li Y H, Hou U L, Yiu M L, Gong Z G. Discovering longest-lasting correlation in sequence databases. Proceedings of the VLDB Endowment, 2013, 6(14): 1666–1677

[10]	Wang Y, Wang P, Pei J, Huang S. A data-adaptive and dynamic segmentation index for whole matching on time series. Proceedings of the VLDB Endowment, 2013, 6(10): 793–804

[11]	Jeffrey J, Jeff M P, Li F F, Tang M W. Ranking large temporal data. Proceedings of the VLDB Endowment, 2012, 5(11): 1412–1423

[12]	Luo W M, Tan H Y, Chen L, Lione l M. Finding time period-based most frequent path in big trajectory data. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 713–724

[13]	Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases. In: Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 1993, 69–84

[14]	Chan K P, Fu W C. Efficient time series matching by wavelets. In: Proceedings of the IEEE International Conference on Data Engineering. 1999, 126–133

[15]	Keogh E, Chakrabarti K, Pazzani M, Mehrotra S. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 2002, 27(2): 188–228

[16]	Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh E. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowledge & Information Systems, 2014, 39(1):123–151

[17]	Faloutsos C, Ranganathan M, Manolopoulos Y. Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 1994, 419–429

[18]	Soroush E, Balazinska M, Wang D. ArrayStore: a storage manager for complex parallel array processing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 253–264

[19]	Das S, Sismanis Y, Beyer K S. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 987–998

[20]	Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12