TestBench: evaluating class-level test case generation capability of large language models
Quanjun ZHANG , Ye SHANG , Chunrong FANG , Siqi GU , Shengcheng YU , Jianyi ZHOU , Zhenyu CHEN
Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (6) : 2106202
Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based software testing techniques, particularly in the area of test case generation. Despite the growing interest, limited efforts have been made to thoroughly evaluate the actual capabilities of LLMs in this task. In this paper, we introduce TestBench, a benchmark for class-level LLM-based test case generation. We construct a dataset of 108 Java programs from nine real-world, large-scale projects on GitHub, each representing a different thematic domain. We then design three distinct types of prompts based on context descriptions, including self-contained context, full context, and simple context. Besides, we propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate. Furthermore, we propose a heuristic algorithm to repair erroneous test cases generated by LLMs. We evaluate CodeLlama-13b, GPT-3.5, and GPT-4 on the TestBench, and our experimental results indicate that larger models demonstrate a greater ability to effectively utilize contextual information, leading to generate higher-quality test cases. Smaller models may struggle with the noise introduced by the extensive information contained within the full context. However, when using the simplified version, namely the simple context, which is derived from the full context via abstract syntax tree analysis, the performance of these models improves significantly. Our analysis highlights the current progress and pinpoints future directions to further enhance the effectiveness of models by handling contextual information for test case generation.
test case generation / large language models / benchmarks / LLM4SE
| [1] |
|
| [2] |
|
| [3] |
Daka E, Fraser G. A survey on unit testing practices and problems. In: Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering. 2014, 201−211 |
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
Ma L, Artho C, Zhang C, Sato H, Gmeiner J, Ramler R. GRT: program-analysis-guided random testing (T). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2015, 212−223 |
| [9] |
|
| [10] |
|
| [11] |
Li R, Allal L B, Zi Y, Muennighoff N, Kocetkov D, Mou C, Marone M, Akiki C, Li J, Chim J, Liu Q, Zheltonozhskii E, Zhuo T Y, Wang T, Dehaene O, Davaadorj M, Lamy-Poirier J, Monteiro J, Shliazhko O, Gontier N, Meade N, Zebaze A, Yee M H, Umapathi L K, Zhu J, Lipkin B, Oblokulov M, Wang Z, Murthy R, Stillerman J, Patel S S, Abulkhanov D, Zocca M, Dey M, Zhang Z, Fahmy N, Bhattacharyya U, Yu W, Singh S, Luccioni S, Villegas P, Kunakov M, Zhdanov F, Romero M, Lee T, Timor N, Ding J, Schlesinger C, Schoelkopf H, Ebert J, Dao T, Mishra M, Gu A, Robinson J, Anderson C J, Dolan-Gavitt B, Contractor D, Reddy S, Fried D, Bahdanau D, Jernite Y, Ferrandis C M, Hughes S, Wolf T, Guha A, von Werra L, de Vries H. StarCoder: may the source be with you! 2025,arXiv preprint arXiv: 2305.06161 |
| [12] |
|
| [13] |
|
| [14] |
Ahmed T, Pai K S, Devanbu P, Barr E T. Automatic semantic augmentation of language model prompts (for code summarization). In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 2720−2732 |
| [15] |
|
| [16] |
|
| [17] |
Zhang Q, Fang C, Zhang T, Yu B, Sun W, Chen Z. Gamma: revisiting template-based automated program repair via mask prediction. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2023, 535−547 |
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
Cheng X, Sang F, Zhai Y, Zhang X, Kim T. Rug: turbo Llm for rust unit test generation. In: Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). 2025, 2983−2995 |
| [40] |
|
| [41] |
|
| [42] |
Zhang Q, Fang C, Yu B, Sun W, Zhang T, Chen Z. Pre-trained model-based automated software vulnerability repair: how far are we? IEEE Transactions on Dependable and Secure Computing, 2024, 21(4): 2507−2525 |
| [43] |
|
| [44] |
|
| [45] |
|
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
Yu H, Shen B, Ran D, Zhang J, Zhang Q, Ma Y, Liang G, Li Y, Wang Q, Xie T. CoderEval: a benchmark of pragmatic code generation with generative pretrained models. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 428−439 |
| [50] |
Du X, Liu M, Wang K, Wang H, Liu J, Chen Y, Feng J, Sha C, Peng X, Lou Y. Evaluating large language models in class-level code generation. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 982−994 |
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
Higher Education Press
/
| 〈 |
|
〉 |