Fast correlation coefficient estimation algorithm for HBase-based massive time series data
Wen LIU, Tuqian ZHANG, Yanming SHEN, Peng WANG
Fast correlation coefficient estimation algorithm for HBase-based massive time series data
In recent years, the rapid development of Internet of Things and sensor networks makes the time series data experiencing explosive growth. OpenTSDB and other emerging systems begin to use Hadoop, HBase to store massive time series data, and how to use these platforms to query and mine time series data has become a current research hotspot. As a typical time series distance measurementmethod, correlation coefficient is widely used in various applications. However, it requires a large amount of I/O and network transmission to compute the correlation coefficient of long time sequence on HBase in real time, and therefore cannot be applied to interactive query. To address this problem, in this paper, we present two methods to estimate the correlation coefficients of two sequences on HBase. We first propose a fast estimation algorithm for the upper and lower bounds of correlation coefficient, named as DCE. In order to further reduce the cost of I/O, we extend the DCE algorithm, and propose the ADCE algorithm, which can estimate the correlation coefficient quickly with an iterative manner. Experiments show that the algorithms proposed in this paper can quickly calculate the correlation coefficient of the long time series.
time series / HBase / correlation coefficient / fast estimation
[1] |
Mueen A, Nath S, Liu J. Fast approximate correlation for massive timeseries data. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 171–182
CrossRef
Google scholar
|
[2] |
Tao Y F, Papadias D, Faloutsos C. Approximate temporal aggregation. In: Proceedings of the 20th IEEE International Conference on Data Engineering. 2004, 190–201
CrossRef
Google scholar
|
[3] |
Tao Y F, Yi K, Sheng C, Pei J, Li F F. Logging every footstep: quantile summaries for the entire history. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 639–650
CrossRef
Google scholar
|
[4] |
Esling P, Agon C. Time-series data mining. ACM Computing Surveys, 2012, 45(1): 12
CrossRef
Google scholar
|
[5] |
Camerra A, Palpanas T, Shieh J, Keogh E. iSAX 2.0: indexing and mining one billion time series. In: Proceedings of the 10th IEEE International Conference on Data Mining. 2010, 58–67
CrossRef
Google scholar
|
[6] |
Yang J, Widom J. Incremental computation and maintenance of temporal aggregates. The VLDB Journal — The International Journal on Very Large Data Bases, 2003, 12(3): 262–283
|
[7] |
Jin J, An N, Sivasubramaniam A. Analyzing range queries on spatial data. In: Proceedings of the 16th IEEE International Conference on Data Engineering. 2000, 525–534
CrossRef
Google scholar
|
[8] |
Mueen A, Hamooni H, Estrada T. Time series join on subsequence correlation. In: Proceedings of the 2014 IEEE International Conference on Data Mining. 2014, 450–459
CrossRef
Google scholar
|
[9] |
Li Y H, Hou U L, Yiu M L, Gong Z G. Discovering longest-lasting correlation in sequence databases. Proceedings of the VLDB Endowment, 2013, 6(14): 1666–1677
CrossRef
Google scholar
|
[10] |
Wang Y, Wang P, Pei J, Huang S. A data-adaptive and dynamic segmentation index for whole matching on time series. Proceedings of the VLDB Endowment, 2013, 6(10): 793–804
CrossRef
Google scholar
|
[11] |
Jeffrey J, Jeff M P, Li F F, Tang M W. Ranking large temporal data. Proceedings of the VLDB Endowment, 2012, 5(11): 1412–1423
CrossRef
Google scholar
|
[12] |
Luo W M, Tan H Y, Chen L, Lione l M. Finding time period-based most frequent path in big trajectory data. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 713–724
CrossRef
Google scholar
|
[13] |
Agrawal R, Faloutsos C, Swami A. Efficient similarity search in sequence databases. In: Proceedings of the International Conference on Foundations of Data Organization and Algorithms. 1993, 69–84
CrossRef
Google scholar
|
[14] |
Chan K P, Fu W C. Efficient time series matching by wavelets. In: Proceedings of the IEEE International Conference on Data Engineering. 1999, 126–133
|
[15] |
Keogh E, Chakrabarti K, Pazzani M, Mehrotra S. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Transactions on Database Systems, 2002, 27(2): 188–228
CrossRef
Google scholar
|
[16] |
Camerra A, Shieh J, Palpanas T, Rakthanmanon T, Keogh E. Beyond one billion time series: indexing and mining very large time series collections with iSAX2+. Knowledge & Information Systems, 2014, 39(1):123–151
CrossRef
Google scholar
|
[17] |
Faloutsos C, Ranganathan M, Manolopoulos Y. Fast subsequence matching in time-series databases. In: Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data. 1994, 419–429
CrossRef
Google scholar
|
[18] |
Soroush E, Balazinska M, Wang D. ArrayStore: a storage manager for complex parallel array processing. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. 2011, 253–264
CrossRef
Google scholar
|
[19] |
Das S, Sismanis Y, Beyer K S. Ricardo: integrating R and Hadoop. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data. 2010, 987–998
CrossRef
Google scholar
|
[20] |
Huang B, Babu S, Yang J. Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. 2013, 1–12
CrossRef
Google scholar
|
/
〈 | 〉 |