From chaos to clarity: log-based kernel panic root cause analysis for large-scale cloud services
Tianyu CUI , Yang ZHANG , Shenglin ZHANG , Xin WU , Yicheng SUI , Liangyan PENG , Yuhe JI , Feng WANG , Changchang LIU , Zeyu CHE , Xiaozhou LIU , Yongqian SUN , Yu ZHANG
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (6) : 2106206
Operating system (OS) kernel panics, which are triggered by unrecoverable fatal errors, pose serious threats to the stability and reliability of ByteDance’s large-scale cloud services. Diagnosing such failures through log analysis is essential for identifying root causes and preventing recurrence. However, root cause analysis (RCA) for kernel panics faces two key challenges. First, only a small portion of logs explicitly indicate the kernel panic, making relevant signals difficult to extract. Second, there exist complex and long-range dependencies across logs, making it difficult to pinpoint root causes effectively. To address these challenges, we propose LogSage, a novel log-based framework for kernel panic RCA in large-scale cloud environments. LogSage combines unsupervised clustering techniques with large language models (LLMs) to extract fault-indicating log snippets, and further employs a graph-based RCA module that integrates Graph Neural Networks (GNNs) for structured log representation and active learning for efficient label utilization. We evaluate LogSage on three real-world datasets, experimental results show that LogSage achieves high performance, with F1-scores of 92.2%, 95.3%, and 96.3%, respectively. These results outperform the strongest baseline methods by 15.5%, 20.3%, and 20.1%. In addition, LogSage has been deployed in ByteDance’s cloud infrastructure for over six months. It has successfully assisted engineers in real-world RCA tasks.
kernel panic / root cause analysis / log analysis / large language model
| [1] |
Li Z, Cheng Q, Hsieh K, Dang Y, Huang P, Singh P, Yang X, Lin Q, Wu Y, Levy S, Chintalapati M. Gandalf: an intelligent, end-to-end analytics service for safe deployment in large-scale cloud infrastructure. In: Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation. 2020, 389–402 |
| [2] |
|
| [3] |
He X, Hua K, Ji C, Lin H, Ren Z, Zhang W. Overview on the growth and development of TikTok’s globalization. In: Proceedings of the 3rd International Conference on Economic Management and Cultural Industry. 2021, 666–673 |
| [4] |
|
| [5] |
Hasanov S, Nagy S, Gazzillo P. A little goes a long way: tuning configuration selection for continuous kernel fuzzing. In: Proceedings of the 47th IEEE/ACM International Conference on Software Engineering. 2025, 795–807 |
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
Wang H, Wu Z, Jiang H, Huang Y, Wang J, Kopru S, Xie T. Groot: An event-graph-based approach for root cause analysis in industrial settings. In: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2021, 419–429 |
| [11] |
Du M, Li F, Zheng G, Srikumar V. DeepLog: anomaly detection and diagnosis from system logs through deep learning. In: Proceedings of 2017 ACM SIGSAC Conference on Computer and Communications Security. 2017, 1285–1298 |
| [12] |
|
| [13] |
|
| [14] |
Le V H, Zhang H. Log-based anomaly detection without log parsing. In: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. 2021, 492–504 |
| [15] |
Xu W, Huang L, Fox A, Patterson D, Jordan M. Online system problem detection by mining patterns of console logs. In: Proceedings of the 9th IEEE International Conference on Data Mining. 2009, 588–597 |
| [16] |
Ricci, Robert, Eric Eide, and CloudLab Team. “Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications”; login: the magazine of USENIX & SAGE 39.6 (2014): 36−38 |
| [17] |
Oliner A, Stearley J. What supercomputers say: a study of five system logs. In: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 2007, 575–584 |
| [18] |
Xu, Wei, et al. “Largescale system problem detection by mining console logs.” Proceedings of SOSP. Vol. 9. 2009 |
| [19] |
Ramos, Juan. “Using tf-idf to determine word relevance in document queries.” Proceedings of the 1st instructional conference on machine learning. Vol. 242. No. 1. 2003 |
| [20] |
Mihalcea R, Tarau P. TextRank: bringing order into text. In: Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. 2004, 404–411 |
| [21] |
|
| [22] |
Li X, Chen P, Jing L, He Z, Yu G. SwissLog: robust and unified deep learning based log anomaly detection for diverse faults. In: Proceedings of the 31st IEEE International Symposium on Software Reliability Engineering (ISSRE). 2020, 92–103 |
| [23] |
|
| [24] |
Huang J, Jiang Z, Liu J, Huo Y, Gu J, Chen Z, Feng C, Dong H, Yang Z, Lyu M R. Demystifying and extracting fault-indicating information from logs for failure diagnosis. In: Proceedings of the 35th IEEE International Symposium on Software Reliability Engineering (ISSRE). 2024, 511–522 |
| [25] |
Lin Q, Zhang H, Lou J G, Zhang Y, Chen X. Log clustering based problem identification for online service systems. In: Proceedings of the 38th IEEE/ACM International Conference on Software Engineering Companion. 2016, 102–111 |
| [26] |
He S, Lin Q, Lou J G, Zhang H, Lyu M R, Zhang D. Identifying impactful service system problems via log analysis. In: Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2018, 60–70 |
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
Wittkopp T, Wiesner P, Kao O. LogRCA: log-based root cause analysis for distributed services. In: Proceedings of the 30th European Conference on Parallel Processing. 2024, 362–376 |
| [31] |
|
| [32] |
Liu Y, Tao S, Meng W, Yao F, Zhao X, Yang H. LogPrompt: prompt engineering towards zero-shot and interpretable log analysis. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 364–365 |
| [33] |
|
| [34] |
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171–4186 |
| [35] |
|
| [36] |
|
| [37] |
Dong Q, Li L, Dai D, Zheng C, Ma J, Li R, Xia H, Xu J, Wu Z, Chang B, Sun X, Li L, Sui Z. A survey on in-context learning. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1107–1128 |
| [38] |
Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 226–231 |
| [39] |
Ankerst, Mihael, et al. "Ordering points to identify the clustering structure." Proc. Acm Sigmod. Vol. 99. 2008. |
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
Tao S, Liu Y, Meng W, Ren Z, Yang H, Chen X. Biglog: unsupervised large-scale pre-training for a unified log representation. In: Proceedings of the 31st IEEE/ACM International Symposium on Quality of Service (IWQoS). 2023, 1–11 |
| [44] |
|
Higher Education Press
/
| 〈 |
|
〉 |