Adaptive watermark generation mechanism based on time series prediction for stream processing
Yang SONG, Yunchun LI, Hailong YANG, Jun XU, Zerong LUAN, Wei LI
Adaptive watermark generation mechanism based on time series prediction for stream processing
The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time. In reality, streaming data usually arrives out-of-order due to factors such as network delay. The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness. Watermark is a special kind of data inserted into the data stream with a timestamp, which helps the framework to decide whether the data received is late and thus be discarded. Traditional watermark generation strategies are periodic; they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy. This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation. This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy. We implement the proposed mechanism on top of Flink and evaluate it with realworld datasets. The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy.
data stream processing / watermark / time series based prediction / dynamic adjustment
[1] |
Iqbal M H, Soomro T R. Big data analysis: apache storm perspective. International Journal of Computer Trends and Technology, 2015, 19(1): 9–14
CrossRef
Google scholar
|
[2] |
Armbrust M, Das T, Torres J, Yavuz B, Zaharia M. Structured streaming: a declarative api for real-time applications in apache spark. In: Proceedings of the 2018 International Conference on Management of Data. 2018, 601–613
CrossRef
Google scholar
|
[3] |
Carbone P, Katsifodimos A, Sweden S, Tzoumas K. Apache flink: stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28–37
|
[4] |
Akidau T, Schmidt E, Whittle S, Bradshaw RPerry F. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 2015, 8(12): 1792–1803
CrossRef
Google scholar
|
[5] |
Akidau T, Balikov A, Bekiroˇglu K, Chernyak S, Haberman L. Mill-Wheel: fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6(11): 1033–1044
CrossRef
Google scholar
|
[6] |
Awad A, Traub J, Sakr S. Adaptive watermarks: a concept drift-based approach for predicting event-time progress in data streams. In: Proceedings of the 22nd International Conference on Extending Database Technology. 2019
|
[7] |
Barlow H B. Unsupervised learning. Neural Computation, 1989, 1(3): 295–311
CrossRef
Google scholar
|
[8] |
Gers F A, Eck D, Schmidhuber J. Applying LSTM to time series predictable through time-window approaches. In: Proceedings of International Conference on Artificial Neural Networks. 2001, 669–676
CrossRef
Google scholar
|
[9] |
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016, 785–794
CrossRef
Google scholar
|
[10] |
Tucker P A, Maier D, Sheard T, Fegaras L. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3): 555–568
CrossRef
Google scholar
|
[11] |
Sun D, Hwang S. DSSP: stream split processing model for high correctness of out-of-order data processing. In: Proceedings of the 1st IEEE International Conference on Artificial Intelligence and Knowledge Engineering. 2018, 193–197
CrossRef
Google scholar
|
[12] |
Mutschler C, Philippsen M. Distributed low-latency out-of-order event processing for high data rate sensor streams. In: Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing. 2013, 1133–1144
CrossRef
Google scholar
|
[13] |
Babu S, Srivastava U, Widom J. Exploiting k-constraints to reduce memory overhead in continuous queries over data streams. ACM Transactions on Database Systems, 2004, 29(3): 545–580
CrossRef
Google scholar
|
[14] |
Kuralenok I E, Marshalkin N, Trofimov A, Novikov B. An optimistic approach to handle out-of-order events within analytical stream processing. In: Proceedings of CEUR Workshop Proceedings. 2018, 22–29
|
[15] |
Dries A, Röckert U. Adaptive concept drift detection. Statistical Analysis and DataMining: The ASA Data Science Journal, 2009, 2(5–6): 311–327
CrossRef
Google scholar
|
[16] |
Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining. 2007, 443–448
CrossRef
Google scholar
|
[17] |
Thein K M M. Apache kafka: next generation distributed messaging system. International Journal of Scientific Engineering and Technology Research, 2014, 3(47): 9478–9483
|
[18] |
Das S. Time Series Analysis. Princeton University Press, Princeton, NJ, 1994
|
[19] |
Gal Y, Ghahramani Z. A theoretically grounded application of dropout in recurrent neural networks. In: Proceedings of the 30th International Con ference on Neural Information Processing Systems. 2016, 1019–1027
|
[20] |
Sanjappa S, Ahmed M. Analysis of logs by using logstash. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. 2017, 579–585
CrossRef
Google scholar
|
/
〈 | 〉 |