From chaos to clarity: log-based kernel panic root cause analysis for large-scale cloud services

Tianyu CUI , Yang ZHANG , Shenglin ZHANG , Xin WU , Yicheng SUI , Liangyan PENG , Yuhe JI , Feng WANG , Changchang LIU , Zeyu CHE , Xiaozhou LIU , Yongqian SUN , Yu ZHANG

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (6) : 2106206

PDF (4817KB)
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (6) :2106206 DOI: 10.1007/s11704-025-50788-0
Software
RESEARCH ARTICLE
From chaos to clarity: log-based kernel panic root cause analysis for large-scale cloud services
Author information +
History +
PDF (4817KB)

Abstract

Operating system (OS) kernel panics, which are triggered by unrecoverable fatal errors, pose serious threats to the stability and reliability of ByteDance’s large-scale cloud services. Diagnosing such failures through log analysis is essential for identifying root causes and preventing recurrence. However, root cause analysis (RCA) for kernel panics faces two key challenges. First, only a small portion of logs explicitly indicate the kernel panic, making relevant signals difficult to extract. Second, there exist complex and long-range dependencies across logs, making it difficult to pinpoint root causes effectively. To address these challenges, we propose LogSage, a novel log-based framework for kernel panic RCA in large-scale cloud environments. LogSage combines unsupervised clustering techniques with large language models (LLMs) to extract fault-indicating log snippets, and further employs a graph-based RCA module that integrates Graph Neural Networks (GNNs) for structured log representation and active learning for efficient label utilization. We evaluate LogSage on three real-world datasets, experimental results show that LogSage achieves high performance, with F1-scores of 92.2%, 95.3%, and 96.3%, respectively. These results outperform the strongest baseline methods by 15.5%, 20.3%, and 20.1%. In addition, LogSage has been deployed in ByteDance’s cloud infrastructure for over six months. It has successfully assisted engineers in real-world RCA tasks.

Graphical abstract

Keywords

kernel panic / root cause analysis / log analysis / large language model

Cite this article

Download citation ▾
Tianyu CUI, Yang ZHANG, Shenglin ZHANG, Xin WU, Yicheng SUI, Liangyan PENG, Yuhe JI, Feng WANG, Changchang LIU, Zeyu CHE, Xiaozhou LIU, Yongqian SUN, Yu ZHANG. From chaos to clarity: log-based kernel panic root cause analysis for large-scale cloud services. Front. Comput. Sci., 2027, 21 (6) : 2106206 DOI:10.1007/s11704-025-50788-0

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Li Z, Cheng Q, Hsieh K, Dang Y, Huang P, Singh P, Yang X, Lin Q, Wu Y, Levy S, Chintalapati M. Gandalf: an intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure. In: Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation. 2020, 389–402

[2]

Ma Y, Hu Y . Business model innovation and experimentation in transforming economies: ByteDance and TikTok. Management and Organization Review, 2021, 17( 2): 382–388

[3]

He X, Hua K, Ji C, Lin H, Ren Z, Zhang W. Overview on the growth and development of TikTok’s globalization. In: Proceedings of the 3rd International Conference on Economic Management and Cultural Industry. 2021, 666–673

[4]

Su Y, Zhao Y, Sun M, Zhang S, Wen X, Zhang Y, Liu X, Liu X, Tang J, Wu W, Pei D . Detecting outlier machine instances through Gaussian mixture variational autoencoder with one dimensional CNN. IEEE Transactions on Computers, 2022, 71( 4): 892–905

[5]

Hasanov S, Nagy S, Gazzillo P. A little goes a long way: tuning configuration selection for continuous kernel fuzzing. In: Proceedings of the 47th IEEE/ACM International Conference on Software Engineering. 2025, 795–807

[6]

Li Z, Narayanan V, Chen X, Zhang J, Burtsev A. Rust for Linux: understanding the security impact of rust in the Linux kernel. In: Proceedings of 2024 Annual Computer Security Applications Conference (ACSAC). 2024, 548–562

[7]

Li P, Xu C, Farman M, Akgul A, Pang Y . Qualitative and stability analysis with lyapunov function of emotion panic spreading model insight of fractional operator. Fractals, 2024, 32( 2): 2440011

[8]

Zhou X, Peng X, Xie T, Sun J, Ji C, Li W, Ding D . Fault analysis and debugging of microservice systems: industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 2021, 47( 2): 243–260

[9]

Zawawy H, Kontogiannis K, Mylopoulos J. Log filtering and interpretation for root cause analysis. In: Proceedings of 2010 IEEE International Conference on Software Maintenance. 2010, 1–5

[10]

Wang H, Wu Z, Jiang H, Huang Y, Wang J, Kopru S, Xie T. Groot: An event-graph-based approach for root cause analysis in industrial settings. In: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2021, 419–429

[11]

Du M, Li F, Zheng G, Srikumar V. DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, 1285–1298

[12]

Meng W, Liu Y, Zhu Y, Zhang S, Pei D, Liu Y, Chen Y, Zhang R, Tao S, Sun P, Zhou R. Loganomaly: unsupervised detection of sequential and quantitative anomalies in unstructured logs. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 4739–4745

[13]

Guo H, Yuan S, Wu X. LogBERT: log anomaly detection via BERT. In: Proceedings of 2021 International Joint Conference on Neural Networks. 2021, 1–8

[14]

Le V H, Zhang H. Log-based anomaly detection without log parsing. In: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. 2021, 492–504

[15]

Xu W, Huang L, Fox A, Patterson D, Jordan M. Online system problem detection by mining patterns of console logs. In: Proceedings of the 9th IEEE International Conference on Data Mining. 2009, 588–597

[16]

Ricci, Robert, Eric Eide, and CloudLab Team. “Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications”; login: the magazine of USENIX & SAGE 39.6 (2014): 36−38

[17]

Oliner A, Stearley J. What supercomputers say: a study of five system logs. In: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2007, 575–584

[18]

Xu, Wei, et al. “Largescale system problem detection by mining console logs.” Proceedings of SOSP. Vol. 9. 2009

[19]

Ramos, Juan. “Using tf-idf to determine word relevance in document queries.” Proceedings of the 1st instructional conference on machine learning. Vol. 242. No. 1. 2003

[20]

Mihalcea R, Tarau P. TextRank: bringing order into text. In: Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. 2004, 404–411

[21]

Zhang X, Xu Y, Qin S, He S, Qiao B, Li Z, Zhang H, Li X, Dang Y, Lin Q, Chintalapati M, Rajmohan S, Zhang D. Onion: identifying incident-indicating logs for cloud systems. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2021, 1253–1263

[22]

Li X, Chen P, Jing L, He Z, Yu G. SwissLog: robust and unified deep learning based log anomaly detection for diverse faults. In: Proceedings of the 31st IEEE International Symposium on Software Reliability Engineering (ISSRE). 2020, 92–103

[23]

Shan S, Huo Y, Su Y, Li Y, Li D, Zheng Z. Face it yourselves: an LLM-based two-stage strategy to localize configuration errors via logs. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 2024, 13–25

[24]

Huang J, Jiang Z, Liu J, Huo Y, Gu J, Chen Z, Feng C, Dong H, Yang Z, Lyu M R. Demystifying and extracting fault-indicating information from logs for failure diagnosis. In: Proceedings of the 35th IEEE International Symposium on Software Reliability Engineering (ISSRE). 2024, 511–522

[25]

Lin Q, Zhang H, Lou J G, Zhang Y, Chen X. Log clustering based problem identification for online service systems. In: Proceedings of the 38th IEEE/ACM International Conference on Software Engineering Companion. 2016, 102–111

[26]

He S, Lin Q, Lou J G, Zhang H, Lyu M R, Zhang D. Identifying impactful service system problems via log analysis. In: Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2018, 60–70

[27]

Lu S, Rao B, Wei X, Tak B, Wang L, Wang L. Log-based abnormal task detection and root cause analysis for spark. In: Proceedings of 2017 IEEE International Conference on Web Services (ICWS). 2017, 389–396

[28]

Notaro P, Haeri S, Cardoso J, Gerndt M . LogRule: efficient structured log mining for root cause analysis. IEEE Transactions on Network and Service Management, 2023, 20( 4): 4231–4243

[29]

Sui Y, Zhang Y, Sun J, Xu T, Zhang S, Li Z, Sun Y, Guo F, Shen J, Zhang Y, Pei D, Yang X, Yu L . LogKG: log failure diagnosis through knowledge graph. IEEE Transactions on Services Computing, 2023, 16( 5): 3493–3507

[30]

Wittkopp T, Wiesner P, Kao O. LogRCA: log-based root cause analysis for distributed services. In: Proceedings of the 30th European Conference on Parallel Processing. 2024, 362–376

[31]

Tak B C, Tao S, Yang L, Zhu C, Ruan Y. LOGAN: problem diagnosis in the cloud using log-based reference models. In: Proceedings of 2016 IEEE International Conference on Cloud Engineering (IC2E). 2016, 62–67

[32]

Liu Y, Tao S, Meng W, Yao F, Zhao X, Yang H. LogPrompt: prompt engineering towards zero-shot and interpretable log analysis. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 364–365

[33]

Jiao W, Wang W, Huang J T, Wang X, Tu Z. Is ChatGpt a good translator? A preliminary study. 2023, arXiv preprint arXiv: 2301.08745

[34]

Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171–4186

[35]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000–6010

[36]

Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E H, Le Q V, Zhou D. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1800

[37]

Dong Q, Li L, Dai D, Zheng C, Ma J, Li R, Xia H, Xu J, Wu Z, Chang B, Sun X, Li L, Sui Z. A survey on in-context learning. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1107–1128

[38]

Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 226–231

[39]

Ankerst, Mihael, et al. "Ordering points to identify the clustering structure." Proc. Acm Sigmod. Vol. 99. 2008.

[40]

Liu A, Feng B, Xue B, Wang B, Wu B, , et al. DeepSeek-V3 technical report. 2024, arXiv preprint arXiv: 2412.19437

[41]

Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, , et al. The Llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783

[42]

Jiang A Q, Sablayrolles A, Mensch A, Bamford C, Chaplot D S, de las Casas D, Bressand F, Lengyel G, Lample G, Saulnier L, Lavaud L R, Lachaux M A, Stock P, Le Scao T, Lavril T, Wang T, Lacroix T, El Sayed W. Mistral 7B. 2023, arXiv preprint arXiv: 2310.06825

[43]

Tao S, Liu Y, Meng W, Ren Z, Yang H, Chen X. Biglog: unsupervised large-scale pre-training for a unified log representation. In: Proceedings of the 31st IEEE/ACM International Symposium on Quality of Service (IWQoS). 2023, 1–11

[44]

Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 1025–1035

RIGHTS & PERMISSIONS

Higher Education Press

PDF (4817KB)

Supplementary files

Highlights

352

Accesses

0

Citation

Detail

Sections
Recommended

/