Pluggable AI-based real-time stragglers detection framework in Hadoop

Xinyuan Liu; Yinhao Li; Rajiv Ranjan; Devki Nandan Jha

doi:10.1016/j.hcc.2025.100341

High-Confidence Computing ›› 2026, Vol. 6 ›› Issue (1) :100341 DOI: 10.1016/j.hcc.2025.100341

Research Articles

research-article

Pluggable AI-based real-time stragglers detection framework in Hadoop

Author information +

History +

PDF (1072KB)

Abstract

The growing reliance on big data frameworks such as Hadoop has revolutionized data processing across various domains, enabling large-scale storage and distributed computation. Hadoop is widely employed in real-world applications such as high-performance computation tasks, e-commerce and data analysis in healthcare. However, the efficiency of Hadoop systems is often hampered by faults and anomalies, with stragglers emerging as one of the most prevalent issues. Stragglers disrupt workflows, waste resources and degrade system performance. While existing anomaly detection models employ methods like median analysis or static thresholds, they often struggle with issues such as high false positives, lack of adaptability and poor handling of complex heterogeneous environments. To address these challenges, this paper presents Plabs, a flexible stragglers detection framework for Hadoop. The framework comprises two core components: (1) a Monitoring Module providing real-time tracking of cluster resources and task progress and (2) a Pluggable AI-based straggler detection module, designed for precise straggler task identification. By leveraging advanced monitoring and AI-driven analysis, Plabs offers an automated, flexible and scalable solution for detecting stragglers at run-time in Hadoop clusters. We evaluated Plabs exhaustively with three Machine Learning (ML), two Deep Learning (DL) and two Large Language Models (LLMs) on five different applications in a real testbed environment. Our experiment evaluation shows that DL models outperform others in identifying Hadoop stragglers, achieving superior accuracy and reliability for all the applications.

Keywords

Anomaly detection / Big data / Hadoop / Pluggable AI models

Cite this article

Download citation ▾

Xinyuan Liu, Yinhao Li, Rajiv Ranjan, Devki Nandan Jha. Pluggable AI-based real-time stragglers detection framework in Hadoop. High-Confidence Computing, 2026, 6(1): 100341 DOI:10.1016/j.hcc.2025.100341

登录浏览全文

4963

注册一个新账户忘记密码

CRediT authorship contribution statement

Xinyuan Liu: Software, Validation, Writing - original draft, Formal analysis, Methodology, Investigation. Yinhao Li: Writing - review & editing, Supervision. Rajiv Ranjan: Supervision, Funding acquisition, Conceptualization, Writing - review & editing. Devki Nandan Jha: Writing - review & editing, Formal analysis, Conceptualization, Project administration, Supervision.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This study was partly supported by the National Edge AI Hub for Real Data: Edge Intelligence for Cyberdisturbances and Data Quality (UKRI EPSRC EP/Y028813/1).

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	J. Dean, S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. ACM 51 (1) (2008) 107-113.

[2]	Z. Wen, R. Qasha, Z. Li, R. Ranjan, P. Watson, A. Romanovsky, Dynamically partitioning workflow over federated clouds to optimise the monetary cost and handle run-time failures, IEEE Trans. Cloud Comput. 8 (4) (2016) 1093-1107.

[3]	S.S. Gill, X. Ouyang, P. Garraghan, Tails in the cloud: a survey and taxonomy of straggler management within large-scale cloud data centres, J. Supercomput. 76 (2020) 10050-10089.

[4]	T.-D. Phan, G. Pallez, S. Ibrahim, P. Raghavan, A new framework for evaluating straggler detection mechanisms in MapReduce, ACM Trans. Model. Perform. Eval. Comput. Syst. 4 (3) (2019) Article No. 14, 1-23.

[5]	M. Xu, C. Wang, Y. Zou, D. Yu, X. Cheng, W. Lyu, Curb: Trusted and scalable software-defined network control plane for edge computing, in: 2022 IEEE 42nd International Conference on Distributed Computing Systems, ICDCS, 2022, pp. 492-502, http://dx.doi.org/10.1109/ICDCS54860.2022.00054.

[6]	J. Chen, F. Liu, J. Jiang, G. Zhong, D. Xu, Z. Tan, S. Shi, TraceGra: A trace-based anomaly detection for microservice using graph deep learning, Comput. Commun. 204 (2023) 109-117.

[7]	A. Hrusto, E. Engström, P. Runeson, Optimization of anomaly detection in a microservice system through continuous feedback from development, in:Proceedings of the 10th IEEE/ACM International Workshop on Software Engineering for Systems-of-Systems and Software Ecosystems, 2022, pp. 13-20.

[8]	T. White, Hadoop:The Definitive Guide, O’Reilly Media, Inc., 2012.

[9]	V.K. Vavilapalli, A.C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al., Apache hadoop yarn: Yet another resource negotiator, in: Proceedings of the 4th Annual Symposium on Cloud Computing, 2013, pp. 1-16.

[10]	G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, E. Harris, Reining in the outliers in map-reduce clusters using mantri, in: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, USENIX Association, USA, 2010, pp. 265-278.

[11]	W. Dai, I. Ibrahim, M. Bassiouni, An improved straggler identification scheme for data-intensive computing on cloud platforms, in: 2017 IEEE 4th International Conference on Cyber Security and Cloud Computing (CSCloud), 2017, pp. 211-216.

[12]	H. Zhou, Y. Li, H. Yang, J. Jia, W. Li, BigRoots: An effective approach for root-cause analysis of stragglers in big data system, IEEE Access 6 (2018) 41966-41977.

[13]	S. Deshmukh, K.T. Rao, Straggler identification approach in large data processing frameworks using ensembled gradient boosting in smart-cities cloud services, Int. J. Syst. Assur. Eng. Manag. 13 (2021) 146-155.

[14]	M. Xu, Z. Zou, Y. Cheng, Q. Hu, D. Yu, X. Cheng, SPDL: A blockchain-enabled secure and privacy-preserving decentralized learning system, IEEE Trans. Comput. 72 (2) (2022) 548-558.

[15]	Z. Wang, Q. Hu, R. Li, M. Xu, Z. Xiong, Incentive mechanism design for joint resource allocation in blockchain-based federated learning.

[16]	K.L. Bawankule, R.K. Dewang, A.K. Singh, Early straggler tasks detection by recurrent neural network in a heterogeneous environment, Appl. Intell. 53 (7369-7389) (2022).

[17]	M.V. Mäntylä, Y. Wang, J. Nyyssoïä, LogLead - fast and integrated log loader, enhancer, and anomaly detector, IEEE Trans. Softw. Eng. (2024).

[18]	S. Andonov, G. Madjarov, LogGC: Novel approach for graph-based log anomaly detection, IEEE Trans. Dependable Secur. Comput. (2023).

[19]	C. Zhu, B. Han, G. Li, PAC: A monitoring framework for performance analysis of compression algorithms in spark, Future Gener. Comput. Syst. 157 (2024) 237-249.

[20]	S. Lu, N. Han, M. Wang, X. Wei, Z. Lin, D. Wang, SSDLog: a semi-supervised dual branch model for log anomaly detection, World Wide Web 26 (2023) 3137-3153.

[21]	S. Ali, C. Boufaied, D. Bianculli, P. Branco, L. Briand, N. Aschbacher, An empirical study on log-based anomaly detection using machine learning, J. Syst. Softw. (2023).

[22]	B. Yu, J. Yao, Q. Fu, Z. Zhong, H. Xie, Y. Wu, Y. Ma, P. He, Deep learning or classical machine learning? An empirical study on log-based anomaly detection, IEEE Trans. Softw. Eng. (2024).

[23]	Y. Yao, J. Duan, K. Xu, Y. Cai, Z. Sun, Y. Zhang, A survey on large language model (LLM) security and privacy: The good, the bad, and the ugly, High- Confid. Comput. (2024) 100211.

[24]	B. Yan, K. Li, M. Xu, Y. Dong, Y. Zhang, Z. Ren, X. Cheng, On protecting the data privacy of large language models (LLMs) and LLM agents: A literature review, High-Confid. Comput. (2025) 100300.

[25]	C. Egersdoerfer, D. Zhang, D. Dai, Early exploration of using ChatGPT for log-based anomaly detection on parallel file systems logs, in:Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing, 2023, http://dx.doi.org/10.1145/3588195.3595943.

[26]	J. Qi, S. Huang, S. Yang, et al., LogGPT: Exploring ChatGPT for log-based anomaly detection, IEEE Trans. Neural Netw. Learn. Syst. (2023) http://dx.doi.org/10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00045.

[27]	Y.W. Heng, Z. Ma, Z. Li, et al., Studying and benchmarking large language models for log level suggestion, J. Softw. Syst. (2024).

[28]	J. Huang, Z. Jiang, Z. Chen, M.R. Lyu, LUNAR: Unsupervised LLM-based log parsing, J. Comput. Sci. Tech. (2024).

[29]	J. Xu, Z. Cui, Y. Zhao, et al., UniLog: Automatic logging via LLM and incontext learning, in: IEEE/ACM 46th International Conference on Software Engineering, 2024, p. 12, http://dx.doi.org/10.1145/3597503.3623326.

[30]	Q. Long, Y. Wu, W. Wang, et al., Does in-context learning really learn? Rethinking how large language models respond and solve tasks via in-context learning, J. Appl. Artif. Intell. (2024).

[31]	X. Liu, D.N. Jha, Y. Li, M. Barika, U. Demirbaga, R. Ranjan, AUTOMATE: Automatic anomaly detection and root cause analysis framework for hadoop, in: International Conference Meta Computing, IEEE, 2024, pp. 1-8.

[32]	Prometheus, Prometheus: Monitoring system & time series database, 2024, https://prometheus.io/.

[33]	S. Sperandei, Understanding logistic regression analysis, Biochem. Medica 24 (2014) 12-18.

[34]	G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning, Springer, New York, 2013.

[35]	P. Cunningham, S.J. Delany, K-nearest neighbour classifiers-a tutorial, ACM Comput. Surv. 54 (6) (2021) 1-25.

[36]	G. Guo, H. Wang, D. Bell, Y. Bi, K. Greer, KNN model-based approach in classification, in: On the Move To Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, 2003, pp. 986-996.

[37]	D.W. Hosmer, S. Lemeshow, R.X. Sturdivant, Applied Logistic Regression, Wiley, New York, 2000.

[38]	C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.

[39]	I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016, http://www.deeplearningbook.org.

[40]	Meta, Meta llama 3.1, 2024, URL: https://ai.meta.com/blog/meta-llama-3-1/. Meta AI Blog.

[41]	OpenAI,Hello GPT-4 turbo, 2024, URL: https://openai.com/index/hello-gpt-4o/. OpenAI Blog.

[42]	P.R. Visser, D.M. Jones, T.J. Bench-Capon, M.J. Shave, Assessing heterogeneity by classifying ontology mismatches, in: Formal Ontology in Information Systems. Proceedings FOIS, vol. 98, 1998, pp. 148-182.