Adaptive watermark generation mechanism based on time series prediction for stream processing

Yang SONG; Yunchun LI; Hailong YANG; Jun XU; Zerong LUAN; Wei LI

doi:10.1007/s11704-020-0206-7

PDF(2116 KB)

Front. Comput. Sci. ›› 2021, Vol. 15 ›› Issue (6) : 156213. DOI: 10.1007/s11704-020-0206-7

RESEARCH ARTICLE

Adaptive watermark generation mechanism based on time series prediction for stream processing

Yang SONG¹^,² ,
Yunchun LI¹^,² ,
Hailong YANG¹^,² ,
Jun XU³ ,
Zerong LUAN⁴ ,
Wei LI¹^,²

Author information +

History +

Abstract

The data stream processing framework processes the stream data based on event-time to ensure that the request can be responded to in real-time. In reality, streaming data usually arrives out-of-order due to factors such as network delay. The data stream processing framework commonly adopts the watermark mechanism to address the data disorderedness. Watermark is a special kind of data inserted into the data stream with a timestamp, which helps the framework to decide whether the data received is late and thus be discarded. Traditional watermark generation strategies are periodic; they cannot dynamically adjust the watermark distribution to balance the responsiveness and accuracy. This paper proposes an adaptive watermark generation mechanism based on the time series prediction model to address the above limitation. This mechanism dynamically adjusts the frequency and timing of watermark distribution using the disordered data ratio and other lateness properties of the data stream to improve the system responsiveness while ensuring acceptable result accuracy. We implement the proposed mechanism on top of Flink and evaluate it with realworld datasets. The experiment results show that our mechanism is superior to the existing watermark distribution strategies in terms of both system responsiveness and result accuracy.

Keywords

data stream processing / watermark / time series based prediction / dynamic adjustment

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Yang SONG, Yunchun LI, Hailong YANG, Jun XU, Zerong LUAN, Wei LI. Adaptive watermark generation mechanism based on time series prediction for stream processing. Front. Comput. Sci., 2021, 15(6): 156213 https://doi.org/10.1007/s11704-020-0206-7

This is a preview of subscription content, contact us for subscripton.

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Iqbal M H, Soomro T R. Big data analysis: apache storm perspective. International Journal of Computer Trends and Technology, 2015, 19(1): 9–14 CrossRef Google scholar

[2]	Armbrust M, Das T, Torres J, Yavuz B, Zaharia M. Structured streaming: a declarative api for real-time applications in apache spark. In: Proceedings of the 2018 International Conference on Management of Data. 2018, 601–613 CrossRef Google scholar

[3]	Carbone P, Katsifodimos A, Sweden S, Tzoumas K. Apache flink: stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4): 28–37

[4]	Akidau T, Schmidt E, Whittle S, Bradshaw RPerry F. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 2015, 8(12): 1792–1803 CrossRef Google scholar

[5]	Akidau T, Balikov A, Bekiroˇglu K, Chernyak S, Haberman L. Mill-Wheel: fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6(11): 1033–1044 CrossRef Google scholar

[6]	Awad A, Traub J, Sakr S. Adaptive watermarks: a concept drift-based approach for predicting event-time progress in data streams. In: Proceedings of the 22nd International Conference on Extending Database Technology. 2019

[7]	Barlow H B. Unsupervised learning. Neural Computation, 1989, 1(3): 295–311 CrossRef Google scholar

[8]	Gers F A, Eck D, Schmidhuber J. Applying LSTM to time series predictable through time-window approaches. In: Proceedings of International Conference on Artificial Neural Networks. 2001, 669–676 CrossRef Google scholar

[9]	Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining. 2016, 785–794 CrossRef Google scholar

[10]	Tucker P A, Maier D, Sheard T, Fegaras L. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3): 555–568 CrossRef Google scholar

[11]	Sun D, Hwang S. DSSP: stream split processing model for high correctness of out-of-order data processing. In: Proceedings of the 1st IEEE International Conference on Artificial Intelligence and Knowledge Engineering. 2018, 193–197 CrossRef Google scholar

[12]	Mutschler C, Philippsen M. Distributed low-latency out-of-order event processing for high data rate sensor streams. In: Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing. 2013, 1133–1144 CrossRef Google scholar

[13]	Babu S, Srivastava U, Widom J. Exploiting k-constraints to reduce memory overhead in continuous queries over data streams. ACM Transactions on Database Systems, 2004, 29(3): 545–580 CrossRef Google scholar

[14]	Kuralenok I E, Marshalkin N, Trofimov A, Novikov B. An optimistic approach to handle out-of-order events within analytical stream processing. In: Proceedings of CEUR Workshop Proceedings. 2018, 22–29

[15]	Dries A, Röckert U. Adaptive concept drift detection. Statistical Analysis and DataMining: The ASA Data Science Journal, 2009, 2(5–6): 311–327 CrossRef Google scholar

[16]	Bifet A, Gavalda R. Learning from time-changing data with adaptive windowing. In: Proceedings of the 2007 SIAM International Conference on Data Mining. 2007, 443–448 CrossRef Google scholar

[17]	Thein K M M. Apache kafka: next generation distributed messaging system. International Journal of Scientific Engineering and Technology Research, 2014, 3(47): 9478–9483

[18]	Das S. Time Series Analysis. Princeton University Press, Princeton, NJ, 1994

[19]	Gal Y, Ghahramani Z. A theoretically grounded application of dropout in recurrent neural networks. In: Proceedings of the 30th International Con ference on Neural Information Processing Systems. 2016, 1019–1027

[20]	Sanjappa S, Ahmed M. Analysis of logs by using logstash. In: Proceedings of the 5th International Conference on Frontiers in Intelligent Computing: Theory and Applications. 2017, 579–585 CrossRef Google scholar