A comprehensive study on fault tolerance in stream processing systems

Xiaotong WANG, Chunxi ZHANG, Junhua FANG, Rong ZHANG, Weining QIAN, Aoying ZHOU

PDF(9155 KB)
PDF(9155 KB)
Front. Comput. Sci. ›› 2022, Vol. 16 ›› Issue (2) : 162603. DOI: 10.1007/s11704-020-0248-x
Information Systems
REVIEW ARTICLE

A comprehensive study on fault tolerance in stream processing systems

Author information +
History +

Abstract

Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data. Since stream processing systems (SPSs) usually require distributed deployment on clusters of servers in face of large-scale of data, it is especially common to meet with failures of processing nodes or communication networks, but should be handled seriously considering service quality. A failed system may produce wrong results or become unavailable, resulting in a decline in user experience or even significant financial loss. Hence, a large amount of fault tolerance approaches have been proposed for SPSs. These approaches often have their own priorities on specific performance concerns, e.g., runtime overhead and recovery efficiency. Nevertheless, there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs, which will become an obstacle for the development of SPSs. Therefore, we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs. Furthermore, we propose an evaluation framework tailored for fault tolerance, demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs. Finally, we specify future research directions in this domain.

Graphical abstract

Keywords

fault tolerance / performance evaluation / stream processing

Cite this article

Download citation ▾
Xiaotong WANG, Chunxi ZHANG, Junhua FANG, Rong ZHANG, Weining QIAN, Aoying ZHOU. A comprehensive study on fault tolerance in stream processing systems. Front. Comput. Sci., 2022, 16(2): 162603 https://doi.org/10.1007/s11704-020-0248-x

References

[1]
Naughton J F , DeWitt D J , Maier D , Aboulnaga A , Chen J J , Galanis L , Kang J , Krishnamurthy R , Luo Q , Prakash N , Ramamurthy R , Shanmugasundaram J , Tian F , Tufte K , Viglas S , Wang Y , Zhang C , Jackson B , Chen R . The niagara internet query system. IEEE Data Engineering Bulletin, 2001, 24( 2): 27– 33
[2]
Abadi D J , Carney D , Çetintemel U , Cherniack M , Convey C , Lee S , Stonebraker M , Tatbul N , B Zdonik S . Aurora: a new model and architecture for data stream management. The International Journal on Very Large Data Bases, 2003, 12( 2): 120– 139
[3]
Motwani R, Arasu J, Widomand A, Babcock B, Babu S, Datar M, Olston G S, Mankuand C, Varma J, Rosensteinand R. Query processing, approximation, and resource management in a data stream management system. In: Proceedings of International Conference on Innovative Data Systems Research. 2003
[4]
Chandrasekaran S, Cooper O, Deshpande A, J Franklin M, M Hellerstein J, Hong W, Krishnamurthy S, Madden S, Raman V, Reiss F, A Shah M. Telegraphcq: continuous dataflow processing for an uncertain world. In: Proceedings of Conference on Innovative Data Systems Research. 2003
[5]
Heinze T, Aniello L, Querzoni L, Jerzak Z. Cloud-based data stream processing. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2014. 238–245
[6]
Cherniack M, Balakrishnan H, Balazinska M, Carney D, Çetintemel U, Xing Y, Zdonik S B. Scalable distributed stream processing. In: Proceedings of International Conference on Innovative Data Systems Research. 2003
[7]
Abadi D J, Ahmad Y, Balazinska M, Çetintemel U, Cherniack M, Hwang J H, Lindner W, Maskey A, Rasin A, Ryvkina E, Tatbul N, Xing Y, B Zdonik S. The design of the borealis stream processing engine. In: Proceedings of Conference on Innovative Data Systems Research. 2005. 277–289
[8]
Dean J Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating System Design and Implementation. 2004, 137–150
[9]
Toshniwal A, Taneja S, Shukla A, Ramasamy K, M Patel J, Kulkarni S, Jackson J, Gade K, Fu M S, Donham J, Bhagat N, Mittal S, V Ryaboy D. Storm@twitter. In: Proceedings of ACM International Conference on Management of Data. 2014, 147–156
[10]
Zaharia M, Das T, Li H Y, Hunter T, Shenker S, Stoica I. Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 423–438
[11]
Carbone P , Katsifodimos A , Ewen S , Markl V , Haridi S , Tzoumas K . Apache flink TM: stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 2015, 38( 4): 28– 38
[12]
Noghabi S A, Paramasivam K, Pan Y, Ramesh N, Bringhurst J, Gupta I, H Campbell R. Stateful scalable stream processing at linkedin. Proceedings of the VLDB Endowment, 2017, 10(12): 1634–1645
[13]
Gill P, Jain N, Nagappan N. Understanding network failures in data centers: measurement, analysis, and implications. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 2011, 350–361
[14]
Dean J. Handling large datasets at google: current systems and future directions. In Data-Intensive Computing Symposium, 2008
[15]
V Vishwanath K Nagappan N. Characterizing cloud computing hardware reliability. In: Proceedings of Symposium on Cloud Computing. 2010, 193–204
[16]
Shankland S. Google spotlights data center inner workings. CNET News, 2008
[17]
Raphael J R. In pictures: the worst cloud outages of 2013. PCWorld, 2013
[18]
Shah M A, Hellerstein J M, Brewer E A. Highly-available, faulttolerant, parallel dataflows. In: Proceedings of ACM International Conference on Management of Data. 2004, 827–838
[19]
Hwang J H, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S B. High-availability algorithms for distributed stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2005, 779–790
[20]
Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Faulttolerance in the borealis distributed stream processing system. In: Proceedings of ACM International Conference on Management of Data. 2005, 13–24
[21]
Hwang J H, Çetintemel U, B Zdonik S. Fast and highly-available stream processing over wide area networks. In: Proceedings of IEEE International Conference on Data Engineering. 2008, 804–813
[22]
Hwang J H, Xing Y, Çetintemel U, B Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2007, 176–185
[23]
Kwon Y C , Balazinska M , G Greenberg A . Fault-tolerant stream processing using a distributed, replicated file system. Proceedings of the VLDB Endowment, 2008, 1( 1): 574– 585
[24]
Gu Y, Zhang Z, Ye F, Yang H, Kim M, Lei H, Liu Z. An empirical study of high availability in stream processing systems. In: Proceedings of ACM/IFIP/USENIX International Middleware Conference. 2009
[25]
Sebepou Z Magoutis K. Cec: continuous eventual checkpointing for data stream processing operators. In: Proceedings of International Conference on Dependable Systems and Networks. 2011, 145–156
[26]
Fernandez R C, Migliavacca M, Kalyvianaki E, R Pietzuch P. Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of ACM International Conference on Management of Data. 2013, 725–736
[27]
Koldehofe B, Mayer R, Ramachandran U, Rothermel K, Völz M. Rollback-recovery without checkpoints in distributed event processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2013, 27–38
[28]
Huang Q Lee P . Toward high-performance distributed stream processing via approximate fault tolerance. Proceedings of the VLDB Endowment, 2016, 10( 3): 73– 84
[29]
Zhang Z, Gu Y, Ye F, Yang H, Kim M, Lei H, Liu Z. A hybrid approach to high availability in stream processing systems. In: Proceedings of IEEE International Conference on Distributed Computing Systems. 2010, 138–148
[30]
Heinze T, Zia M, Krahn R, Jerzak Z, Fetzer C. An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2015, 150–161
[31]
Su L Zhou Y L. Tolerating correlated failures in massively parallel stream processing engines. In: Proceedings of IEEE International Conference on Data Engineering. 2016, 517–528
[32]
Su L Zhou Y L . Passive and partially active fault tolerance for massively parallel stream processing engines. IEEE Transactions on Knowledge and Data Engineering, 2019, 31( 1): 32– 45
[33]
Martin A, Smaneoto T, Dietze T, Brito A, Fetzer C. User-constraint and self-adaptive fault tolerance for event stream processing systems. In: Proceedings og IEEE/IFIP International Conference on Dependable Systems and Networks. 2015
[34]
Schneider F B . Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys, 1990, 22( 4): 299– 319
[35]
Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of ACM Symposium on Principles of Distributed Computing. 2007, 398–407
[36]
Elnozahy E N , Alvisi L , Wang Y M , Johnson D B . A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002, 34( 3): 375– 408
[37]
Balazinska M, Hwang J H, Shah M A. Fault-tolerance and high availability in data stream management systems. In Encyclopedia of Database Systems. 2009, 1109–1115
[38]
Gradvohl A L S , Senger H , Arantes L , Sens P . Comparing distributed online stream processing systems considering fault tolerance issues. Journal of Emerging Technologies in Web Intelligence, 2014, 6( 2): 174– 179
[39]
Nasir M A U. Fault tolerance for stream processing engines. arXiv preprint, abs/1605.00928, 2016
[40]
Arasu A, Cherniack M, Galvez E F, Maier D, Maskey A, Ryvkina E, Stonebraker M, Tibbetts R. Linear road: a stream data management benchmark. In: Proceedings of ACM International Conference on Very Large Data Bases. 2004, 480–491
[41]
Lu R R, Wu G, Xie B, Hu J T. Streambench: towards benchmarking modern distributed stream computing frameworks. In: Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing. 2014, 69–78
[42]
Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng B Y, Poulosky P. Benchmarking streaming computation engines: storm, flink and spark streaming. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops. 2016, 1789–1792
[43]
Grier J. Extending the yahoo! streaming benchmark. Ververica, 2016
[44]
Wang Y J. Stream processing systems benchmark: streambench. Master’s thesis, Aalto University, 2016
[45]
Shukla A, Chaturvedi S, Simmhan Y. Riotbench: a real-time iot benchmark for distributed stream processing platforms. arXiv preprint, abs/1701.08530, 2017
[46]
Bordin M V. A benchmark suite for distributed stream processing systems. PhD thesis, Universidade Federal do Rio Grande Do Su, 2017
[47]
Karimov J, Rabl T, Katsifodimos A, Samarev R, Heiskanen H, Markl V. Benchmarking distributed stream data processing systems. In: Proceedings of IEEE International Conference on Data Engineering. 2018, 1507–1518
[48]
Venkataraman S, Panda A, Ousterhout K, Armbrust M, Ghodsi A, Franklin M J, Recht B, Stoica I. Drizzle: fast and adaptable stream processing at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2017, 374–389
[49]
Akidau T , Bradshaw R , Chambers C , Chernyak S , FernándezMoctezuma R , Lax R , McVeety S , Mills D , Perry F , Schmidt E , Whittle S . The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-oforder data processing. Proceedings of the VLDB Endowment, 2015, 8( 12): 1792– 1803
[50]
Tucker P A , Maier D , Sheard T , Fegaras L . Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 2003, 15( 3): 555– 568
[51]
Chandy K M Lamport L . Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems, 1985, 3( 1): 63– 75
[52]
Wang Y M , Chung P Y , Lin I J , Fuchs W K . Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1995, 6( 5): 546– 554
[53]
Neumeyer L, Robbins B, Nair A, Kesari A. S4: distributed stream computing platform. In: Proceedings of IEEE International Conference on Data Mining Workshops. 2010, 170–177
[54]
Randell B . System structure for software fault tolerance. IEEE Transactions on Software Engineering, 1975, 1( 2): 221– 232
[55]
Murray D G, McSherry F, Isaacs R, Isard M, Barham P, Abadi M. Naiad: a timely dataflow system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 439–455
[56]
Akidau T , Balikov A , Bekiroglu K , Chernyak S , Haberman J , Lax R , McVeety S , Mills D , Nordstrom P , Whittle S . Millwheel: faulttolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6( 11): 1033– 1044
[57]
Qian Z P, He Y, Su C Z, Wu Z J, Zhu H Y, Zhang T Z, Zhou L D, Yu Y, Zhang Z. Timestream: reliable stream computation in the cloud. In: Proceedings of Eurosys Conference. 2013, 1–14
[58]
Bhargava B K S Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems - an optimistic approach. In: Proceedings of Symposium on Reliable Distributed Systems. 1988, 3–12
[59]
Feldman S I Brown C B. Igor: a system for program debugging via reversible execution. In: Proceedings of ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging. 1988, 112–123
[60]
Ghemawat S, Gobioff H, Leung S. The google file system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2003, 29–43
[61]
Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of IEEE Symposium on Mass Storage Systems and Technologies. 2010, 1–10
[62]
Pohl C, Götze P, Sattler K. A cost model for data stream processing on modern hardware. In: Proceedings of International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures. 2017
[63]
Zeuch S , Breß S , Rabl T , Monte B D , Karimov J , Lutz C , Renz M , Traub J , Markl V . Analyzing efficient stream processing on modern hardware. Proceedings of the VLDB Endowment, 2019, 12( 5): 516– 530

Acknowledgements

The work was supported by the National Key Research and Development Plan Project (2018YFB1003404).

RIGHTS & PERMISSIONS

2022 Higher Education Press
AI Summary AI Mindmap
PDF(9155 KB)

Accesses

Citations

Detail

Sections
Recommended

/