Data preparation and quality for code-centric generative software engineering tasks: a systematic literature review
Shihao WENG , Yang FENG , Yining YIN , Zhenlun ZHANG , Baowen XU
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009203
Data preparation and quality for code-centric generative software engineering tasks: a systematic literature review
The rapid advancements in Deep Neural Networks (DNNs) have revolutionized generative software engineering tasks, including code summarization, program repair, code generation, and code translation. However, the performance of DNN models in these tasks heavily depends on the quality of their training and evaluation datasets. This systematic literature review examines 70 primary studies to comprehensively analyze dataset construction methodologies, prevalent data quality challenges, and solutions proposed to address these challenges. Our findings reveal that dataset construction processes significantly influence quality, with common issues such as noise, redundancy, imbalance, and insufficient granularity undermining model effectiveness. We identify key strategies to mitigate these problems, including data augmentation, automated cleaning techniques, and standardized validation frameworks. Furthermore, we highlight the critical role of dataset diversity and timeliness in improving model generalization. This study provides actionable insights for researchers and practitioners in the era of generative AI, where high-quality datasets are essential for developing reliable language models as software engineering tools. By emphasizing rigorous dataset curation and innovative quality assurance methods, our work bridges the gap between theoretical advancements and practical applications, enabling the creation of robust, generalizable models for real-world code-related tasks. The synthesized recommendations aim to guide future research in optimizing dataset design, fostering reproducibility, and addressing evolving challenges in data-driven software engineering.
systematic literature review / data quality / deep learning for software engineering
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
|
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
|
| [17] |
|
| [18] |
|
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
|
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
Song Z, Shang X, Li M, Chen R, Li H, Guo S. Do not have enough data? An easy data augmentation for code summarization. In: Proceedings of the 13th IEEE International Symposium on Parallel Architectures, Algorithms and Programming (PAAP). 2022, 1−6 |
| [46] |
Zhu T, Li Z, Pan M, Shi C, Zhang T, Pei Y, Li X. Revisiting information retrieval and deep learning approaches for code summarization. In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 2023, 328−329 |
| [47] |
Madeiral F, Urli S, Maia M, Monperrus M. BEARS: an extensible java bug benchmark for automatic program repair studies. In: Proceedings of the 26th IEEE international conference on software analysis, evolution and reengineering (SANER). 2019, 468−478 |
| [48] |
|
| [49] |
Fakhoury S, Chakraborty S, Musuvathi M, Lahiri S K. Nl2fix: generating functionally correct code edits from bug descriptions. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 410−411 |
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
|
| [55] |
|
| [56] |
Zhong W, Li C, Zhang Y, Ge Z, Wang J, Ge J, Luo B. An automated and flexible multilingual bug-fix dataset construction system. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2023, 1881−1886 |
| [57] |
|
| [58] |
|
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
Xia C S, Wei Y, Zhang L. Automated program repair in the era of large pre-trained language models. In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). 2023, 1482−1494 |
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
Du X, Liu M, Wang K, Wang H, Liu J, Chen Y, Feng J, Sha C, Peng X, Lou Y. Evaluating large language models in class-level code generation. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 982−994 |
| [71] |
Tiwari S P, Prasad S, Thushara M. Machine learning for translating pseudocode to python: a comprehensive review. In: Proceedings of the 7th International Conference on Intelligent Computing and Control Systems (ICICCS). 2023, 274−280 |
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
Feng Y, Vanam S, Cherukupally M, Zheng W, Qiu M, Chen H. Investigating code generation performance of ChatGPT with crowdsourcing social data. In: Proceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC). 2023, 876−885 |
| [76] |
|
| [77] |
Tony C, Mutas M, Ferreyra N E D, Scandariato R. LLMSecEval: a dataset of natural language prompts for security evaluations. In: Proceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories (MSR). 2023, 588−592 |
| [78] |
|
| [79] |
|
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
Zhu J, Shen M. Research on deep learning based code generation from natural language description. In: Proceedings of the 5th IEEE International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). 2020, 188−193 |
| [84] |
|
| [85] |
|
| [86] |
|
| [87] |
|
| [88] |
|
| [89] |
Jiao M, Yu T, Li X, Qiu G, Gu X, Shen B. On the evaluation of neural code translation: taxonomy and benchmark. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2023, 1529−1541 |
| [90] |
|
| [91] |
Rithy I J, Shakil H H, Mondal N, Sultana F, Shah F M. XTest: a parallel multilingual corpus with test cases for code translation and its evaluation. In: Proceedings of the 25th International Conference on Computer and Information Technology (ICCIT). 2022, 623−628 |
| [92] |
|
| [93] |
Chen B, Golebiowski J, Abedjan Z. Data augmentation for supervised code translation learning. In: Proceedings of the 1st IEEE/ACM International Conference on Mining Software Repositories (MSR). 2024, 444−456 |
| [94] |
Chen B, Golebiowski J, Abedjan Z. Towards data augmentation for supervised code translation. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 352−353 |
| [95] |
|
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
|
| [100] |
Saavedra N, Silva A, Monperrus M. Gitbug-actions: building reproducible bug-fix benchmarks with GitHub actions. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 1−5 |
| [101] |
Silva A, Saavedra N, Monperrus M. GitBug-Java: a reproducible benchmark of recent java bugs. In: Proceedings of the 1st IEEE/ACM International Conference on Mining Software Repositories (MSR). 2024, 118−122 |
| [102] |
|
| [103] |
|
| [104] |
|
| [105] |
|
| [106] |
|
| [107] |
|
| [108] |
|
| [109] |
|
| [110] |
|
| [111] |
|
| [112] |
|
| [113] |
|
| [114] |
|
The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn
/
| 〈 |
|
〉 |