TestBench: evaluating class-level test case generation capability of large language models

Quanjun ZHANG; Ye SHANG; Chunrong FANG; Siqi GU; Shengcheng YU; Jianyi ZHOU; Zhenyu CHEN

doi:10.1007/s11704-025-50078-9

Front. Comput. Sci. ›› 2027, Vol. 21 ›› Issue (6) :2106202 DOI: 10.1007/s11704-025-50078-9

Software

RESEARCH ARTICLE

TestBench: evaluating class-level test case generation capability of large language models

Author information +

History +

PDF (3661KB)

Abstract

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based software testing techniques, particularly in the area of test case generation. Despite the growing interest, limited efforts have been made to thoroughly evaluate the actual capabilities of LLMs in this task. In this paper, we introduce TestBench, a benchmark for class-level LLM-based test case generation. We construct a dataset of 108 Java programs from nine real-world, large-scale projects on GitHub, each representing a different thematic domain. We then design three distinct types of prompts based on context descriptions, including self-contained context, full context, and simple context. Besides, we propose a fine-grained evaluation framework that considers five aspects of test cases: syntactic correctness, compilation correctness, test correctness, code coverage rate, and defect detection rate. Furthermore, we propose a heuristic algorithm to repair erroneous test cases generated by LLMs. We evaluate CodeLlama-13b, GPT-3.5, and GPT-4 on the TestBench, and our experimental results indicate that larger models demonstrate a greater ability to effectively utilize contextual information, leading to generate higher-quality test cases. Smaller models may struggle with the noise introduced by the extensive information contained within the full context. However, when using the simplified version, namely the simple context, which is derived from the full context via abstract syntax tree analysis, the performance of these models improves significantly. Our analysis highlights the current progress and pinpoints future directions to further enhance the effectiveness of models by handling contextual information for test case generation.

Graphical abstract

Keywords

test case generation / large language models / benchmarks / LLM4SE

Cite this article

Download citation ▾

Quanjun ZHANG, Ye SHANG, Chunrong FANG, Siqi GU, Shengcheng YU, Jianyi ZHOU, Zhenyu CHEN. TestBench: evaluating class-level test case generation capability of large language models. Front. Comput. Sci., 2027, 21 (6) : 2106202 DOI:10.1007/s11704-025-50078-9

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Wang J, Huang Y, Chen C, Liu Z, Wang S, Wang Q . Software testing with large language models: survey, landscape, and vision. IEEE Transactions on Software Engineering, 2024, 50( 4): 911–936

[2]	Fraser G, Zeller A. Mutation-driven generation of unit tests and oracles. In: Proceedings of the 19th International Symposium on Software Testing and Analysis. 2010, 147−158

[3]	Daka E, Fraser G. A survey on unit testing practices and problems. In: Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering. 2014, 201−211

[4]	Chipounov V, Kuznetsov V, Candea G . S2E: a platform for in-vivo multi-path analysis of software systems. ACM SIGPLAN Notices, 2011, 46( 3): 265–278

[5]	Cadar C, Godefroid P, Khurshid S, Pasareanu C S, Sen K, Tillmann N, Visser W. Symbolic execution for software testing in practice: preliminary assessment. In: Proceedings of the 33rd International Conference on Software Engineering. 2011, 1066−1071

[6]	Dalal S R, Jain A, Karunanithi N, Leaton J M, Lott C M, Patton G C, Horowitz B M. Model-based testing in practice. In: Proceedings of 1999 International Conference on Software Engineering. 1999, 285−294

[7]	Pacheco C, Ernst M D. Randoop: feedback-directed random testing for Java. In: Proceedings of the 22nd ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications Companion. 2007, 815−816

[8]	Ma L, Artho C, Zhang C, Sato H, Gmeiner J, Ramler R. GRT: program-analysis-guided random testing (T). In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2015, 212−223

[9]	Fraser G, Arcuri A. EvoSuite: automatic test suite generation for object-oriented software. 2011, 416−419

[10]	Baresi L, Miraz M. TestFul: automatic unit-test generation for Java classes. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering. 2010, 281−284

[11]

Li R, Allal L B, Zi Y, Muennighoff N, Kocetkov D, Mou C, Marone M, Akiki C, Li J, Chim J, Liu Q, Zheltonozhskii E, Zhuo T Y, Wang T, Dehaene O, Davaadorj M, Lamy-Poirier J, Monteiro J, Shliazhko O, Gontier N, Meade N, Zebaze A, Yee M H, Umapathi L K, Zhu J, Lipkin B, Oblokulov M, Wang Z, Murthy R, Stillerman J, Patel S S, Abulkhanov D, Zocca M, Dey M, Zhang Z, Fahmy N, Bhattacharyya U, Yu W, Singh S, Luccioni S, Villegas P, Kunakov M, Zhdanov F, Romero M, Lee T, Timor N, Ding J, Schlesinger C, Schoelkopf H, Ebert J, Dao T, Mishra M, Gu A, Robinson J, Anderson C J, Dolan-Gavitt B, Contractor D, Reddy S, Fried D, Bahdanau D, Jernite Y, Ferrandis C M, Hughes S, Wolf T, Guha A, von Werra L, de Vries H. StarCoder: may the source be with you! 2025,arXiv preprint arXiv: 2305.06161

[12]	Wei Y, Wang Z, Liu J, Ding Y, Zhang L. Magicoder: empowering code generation with OSS-instruct. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[13]	Xu T, Miao Y, Fang C, Qian H, Feng X, Chen Z, Wang C, Zhang J, Sun W, Chen Z, Liu Y. A prompt learning framework for source code summarization. 2023, arXiv preprint arXiv: 2312.16066

[14]	Ahmed T, Pai K S, Devanbu P, Barr E T. Automatic semantic augmentation of language model prompts (for code summarization). In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 2720−2732

[15]	Zhang Q, Fang C, Gu S, Shang Y, Chen Z, Xiao L. Large language models for unit testing: a systematic literature review. 2025, arXiv preprint arXiv: 2506.15227

[16]	Zhang Q, Fang C, Xie Y, Ma Y, Sun W, Yang Y, Chen Z. A systematic literature review on large language models for automated program repair. 2024, arXiv preprint arXiv: 2405.01466

[17]	Zhang Q, Fang C, Zhang T, Yu B, Sun W, Chen Z. Gamma: revisiting template-based automated program repair via mask prediction. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2023, 535−547

[18]	Zhang Q, Fang C, Sun W, Liu Y, He T, Hao X, Chen Z . APPT: boosting automated patch correctness prediction via fine-tuning pre-trained models. IEEE Transactions on Software Engineering, 2024, 50( 3): 474–494

[19]	Zhang Q, Fang C, Xie Y, Zhang Y, Yang Y, Sun W, Yu S, Chen Z. A survey on large language models for software engineering. 2023, arXiv preprint arXiv: 2312.15223

[20]	Chen Y, Hu Z, Zhi C, Han J, Deng S, Yin J. ChatUniTest: a framework for LLM-based test generation. In: Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 2024, 572−576

[21]	Yuan Z, Liu M, Ding S, Wang K, Chen Y, Peng X, Lou Y . Evaluating and improving ChatGPT for unit test generation. Proceedings of the ACM on Software Engineering, 2024, 1( FSE): 76

[22]	Gu S, Zhang Q, Li K, Fang C, Tian F, Zhu L, Zhou J, Chen Z. TestART: improving LLM-based unit testing via co-evolution of automated generation and repair iteration. 2024, arXiv preprint arXiv: 2408.03095

[23]	Wang Z, Liu K, Li G, Jin Z. HITS: high-coverage LLM-based unit test generation via method slicing. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2024, 1258−1268

[24]	Yang L, Yang C, Gao S, Wang W, Wang B, Zhu Q, Chu X, Zhou J, Liang G, Wang Q, Chen J. An empirical study of unit test generation with large language models. 2024, arXiv preprint arXiv: 2406.18181

[25]	Ni C, Wang X, Chen L, Zhao D, Cai Z, Wang S, Yang X. CasModaTest: a cascaded and model-agnostic self-directed framework for unit test generation. 2024, arXiv preprint arXiv: 2406.15743

[26]	Zhang Q, Sun W, Fang C, Yu B, Li H, Yan M, Zhou J, Chen Z . Exploring automated assertion generation via large language models. ACM Transactions on Software Engineering and Methodology, 2025, 34( 3): 81

[27]	Zhang Q, Fang C, Zheng Y, Qian R, Yu S, Zhao Y, Zhou J, Yang Y, Zheng T, Chen Z . Improving retrieval-augmented deep assertion generation via joint training. IEEE Transactions on Software Engineering, 2025, 51( 4): 1232–1247

[28]	Zhang Q, Fang C, Zheng Y, Zhang Y, Zhao Y, Huang R, Zhou J, Yang Y, Zheng T, Chen Z . Improving deep assertion generation via fine-tuning retrieval-augmented pre-trained language models. ACM Transactions on Software Engineering and Methodology, 2025, 34( 7): 209

[29]	Enoiu E P, Čaušević A, Ostrand T J, Weyuker E J, Sundmark D, Pettersson P . Automated test generation using model checking: an industrial evaluation. International Journal on Software Tools for Technology Transfer, 2016, 18( 3): 335–353

[30]	Gargantini A, Heitmeyer C . Using model checking to generate tests from requirements specifications. ACM SIGSOFT Software Engineering Notes, 1999, 24( 6): 146–162

[31]	Pǎsǎreanu C S, Mehlitz P C, Bushnell D H, Gundy-Burlet K, Lowry M, Person S, Pape M. Combining unit-level symbolic execution and system-level concrete execution for testing nasa software. In: Proceedings of 2008 International Symposium on Software Testing and Analysis. 2008, 15−26

[32]	Xie T, Marinov D, Schulte W, Notkin D. Symstra: a framework for generating object-oriented unit tests using symbolic execution. In: Proceedings of the 11th International Conference on Tools and Algorithms for the Construction and Analysis of Systems. 2005, 365−381

[33]	Alagarsamy S, Tantithamthavorn C, Aleti A . A3Test: assertion-augmented automated test case generation. Information and Software Technology, 2024, 176: 107565

[34]	Rao N, Jain K, Alon U, Le Goues C, Hellendoorn V J. CAT-LM training language models on aligned code and tests. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering. 2023, 409−420

[35]	Tufano M, Drain D, Svyatkovskiy A, Deng S K, Sundaresan N. Unit test case generation with transformers and focal context. 2020, arXiv preprint arXiv: 2009.05617

[36]	Dakhel A M, Nikanjam A, Majdinasab V, Khomh F, Desmarais M C . Effective test generation using pre-trained Large Language Models and mutation testing. Information and Software Technology, 2024, 171: 107468

[37]	Ouédraogo W C, Kaboré K, Li Y, Tian H, Koyuncu A, Klein J, Lo D, Bissyandé T F. Large-scale, independent and comprehensive study of the power of LLMs for test case generation. 2024, arXiv preprint arXiv: 2407.00225

[38]	Shin J, Hashtroudi S, Hemmati H, Wang S. Domain adaptation for code model-based unit test case generation. In: Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. 2024, 1211−1222

[39]	Cheng X, Sang F, Zhai Y, Zhang X, Kim T. Rug: turbo Llm for rust unit test generation. In: Proceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). 2025, 2983−2995

[40]	Shang Y, Zhang Q, Fang C, Gu S, Zhou J, Chen Z . A large-scale empirical study on fine-tuning large language models for unit testing. Proceedings of the ACM on Software Engineering, 2025, 2( ISSTA): ISSTA074

[41]	Yang L, Yang C, Gao S, Wang W, Wang B, Zhu Q, Chu X, Zhou J, Liang G, Wang Q, Chen J. On the evaluation of large language models in unit test generation. In: Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2024, 1607−1619

[42]	Zhang Q, Fang C, Yu B, Sun W, Zhang T, Chen Z. Pre-trained model-based automated software vulnerability repair: how far are we? IEEE Transactions on Dependable and Secure Computing, 2024, 21(4): 2507−2525

[43]	Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto H P, , et al. Evaluating large language models trained on code. 2021, arXiv preprint arXiv: 2107.03374

[44]	Austin J, Odena A, Nye M, Bosma M, Michalewski H, Dohan D, Jiang E, Cai C, Terry M, Le Q, Sutton C. Program synthesis with large language models. 2021, arXiv preprint arXiv: 2108.07732

[45]	Liu J, Xia C S, Wang Y, Zhang L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 943

[46]

Zheng Q, Xia X, Zou X, Dong Y, Wang S, Xue Y, Shen L, Wang Z, Wang A, Li Y, Su T, Yang Z, Tang J. CodeGeeX: a pre-trained model for code generation with multilingual benchmarking on HumanEval-X. In: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2023, 5673−5684

[47]	Athiwaratkun B, Gouda S K, Wang Z, Li X, Tian Y, , et al. Multi-lingual evaluation of code generation models. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[48]	Lai Y, Li C, Wang Y, Zhang T, Zhong R, Zettlemoyer L, Yih W T, Fried D, Wang S, Yu T. DS-1000: a natural and reliable benchmark for data science code generation. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 756

[49]	Yu H, Shen B, Ran D, Zhang J, Zhang Q, Ma Y, Liang G, Li Y, Wang Q, Xie T. CoderEval: a benchmark of pragmatic code generation with generative pretrained models. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 428−439

[50]	Du X, Liu M, Wang K, Wang H, Liu J, Chen Y, Feng J, Sha C, Peng X, Lou Y. Evaluating large language models in class-level code generation. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 982−994

[51]

Li Z, Lu S, Guo D, Duan N, Jannu S, Jenks G, Majumder D, Green J, Svyatkovskiy A, Fu S, Sundaresan N. Automating code review activities by large-scale pre-training. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2022, 1035−1047

[52]	Liu T, Xu C, McAuley J J. RepoBench: benchmarking repository-level code auto-completion systems. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[53]	Zhang F, Chen B, Zhang Y, Keung J, Liu J, Zan D, Mao Y, Lou J G, Chen W. RepoCoder: repository-level code completion through iterative retrieval and generation. In: Proceedings of 2023 Conference on Empirical Methods in Natural Language Processing. 2023, 2471−2484

[54]	Guo D, Xu C, Duan N, Yin J, McAuley J. LongCoder: a long-range pre-trained language model for code completion. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 486

[55]	Jimenez C E, Yang J, Wettig A, Yao S, Pei K, Press O, Narasimhan K R. SWE-bench: can language models resolve real-world Github issues? In: Proceedings of the 12th International Conference on Learning Representations. 2024

[56]	Wang W, Yang C, Wang Z, Huang Y, Chu Z, Song D, Zhang L, Chen A R, Ma L. TESTEVAL: benchmarking large language models for test case generation. In: Proceedings of the Association for Computational Linguistics: NAACL 2025. 2025, 3547−3562

[57]	Du X, Liu M, Wang K, Wang H, Liu J, Chen Y, Feng J, Sha C, Peng X, Lou Y. ClassEval: a manually-crafted benchmark for evaluating LLMs on class-level code generation. 2023, arXiv preprint arXiv: 2308.01861

[58]	Zhang Q, Fang C, Ma Y, Sun W, Chen Z . A survey of learning-based automated program repair. ACM Transactions on Software Engineering and Methodology, 2023, 33( 2): 55

[59]	Hu H, Shang Y, Xu G, He C, Zhang Q. Can GPT-O1 kill all bugs? An evaluation of GPT-family LLMs on QuixBugs. In: Proceedings of 2025 IEEE/ACM International Workshop on Automated Program Repair (APR). 2025, 11−18

[60]	Zhang Q, Zhang T, Zhai J, Fang C, Yu B, Sun W, Chen Z. A critical review of large language model on software engineering: an example from ChatGPT and automated program repair. 2023, arXiv preprint arXiv: 2310.08879