Data preparation and quality for code-centric generative software engineering tasks: a systematic literature review

Shihao WENG , Yang FENG , Yining YIN , Zhenlun ZHANG , Baowen XU

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009203

PDF (2023KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009203 DOI: 10.1007/s11704-025-41376-3
Software
REVIEW ARTICLE

Data preparation and quality for code-centric generative software engineering tasks: a systematic literature review

Author information +
History +
PDF (2023KB)

Abstract

The rapid advancements in Deep Neural Networks (DNNs) have revolutionized generative software engineering tasks, including code summarization, program repair, code generation, and code translation. However, the performance of DNN models in these tasks heavily depends on the quality of their training and evaluation datasets. This systematic literature review examines 70 primary studies to comprehensively analyze dataset construction methodologies, prevalent data quality challenges, and solutions proposed to address these challenges. Our findings reveal that dataset construction processes significantly influence quality, with common issues such as noise, redundancy, imbalance, and insufficient granularity undermining model effectiveness. We identify key strategies to mitigate these problems, including data augmentation, automated cleaning techniques, and standardized validation frameworks. Furthermore, we highlight the critical role of dataset diversity and timeliness in improving model generalization. This study provides actionable insights for researchers and practitioners in the era of generative AI, where high-quality datasets are essential for developing reliable language models as software engineering tools. By emphasizing rigorous dataset curation and innovative quality assurance methods, our work bridges the gap between theoretical advancements and practical applications, enabling the creation of robust, generalizable models for real-world code-related tasks. The synthesized recommendations aim to guide future research in optimizing dataset design, fostering reproducibility, and addressing evolving challenges in data-driven software engineering.

Graphical abstract

Keywords

systematic literature review / data quality / deep learning for software engineering

Cite this article

Download citation ▾
Shihao WENG, Yang FENG, Yining YIN, Zhenlun ZHANG, Baowen XU. Data preparation and quality for code-centric generative software engineering tasks: a systematic literature review. Front. Comput. Sci., 2026, 20(9): 2009203 DOI:10.1007/s11704-025-41376-3

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Huang Y, Chen Y, Chen X, Chen J, Peng R, Tang Z, Huang J, Xu F, Zheng Z. Generative software engineering. 2024, arXiv preprint arXiv: 2403.02583

[2]

Palacio D N, Velasco A, Rodriguez-Cardenas D, Moran K, Poshyvanyk D. Evaluating and explaining large language models for code using syntactic structures. 2023, arXiv preprint arXiv: 2308.03873

[3]

LeClair A, Jiang S, McMillan C. A neural model for generating natural language summaries of program subroutines. In: Proceedings of 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 2019, 795−806

[4]

Alon U, Brody S, Levy O, Yahav E. code2seq: generating sequences from structured representations of code. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[5]

Allamanis M, Brockschmidt M, Khademi M. Learning to represent programs with graphs. In: Proceedings of the 6th International Conference on Learning Representations. 2018

[6]

Ben-Nun T, Jakobovits A S, Hoefler T. Neural code comprehension: a learnable representation of code semantics. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. 2018, 3589–3601

[7]

Gupta R, Pal S, Kanade A, Shevade S. DeepFix: fixing common C language errors by deep learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1345–1351

[8]

Bader J, Scott A, Pradel M, Chandra S . Getafix: learning to fix bugs automatically. Proceedings of the ACM on Programming Languages, 2019, 3( OOPSLA): 159

[9]

Liu K, Koyuncu A, Kim D, Bissyandé T F. TBar: revisiting template-based automated program repair. In: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 2019, 31−42

[10]

Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M. CodeBERT: a pre-trained model for programming and natural languages. In: Proceedings of the Association for Computational Linguistics: EMNLP 2020. 2020, 1536–1547

[11]

Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 159

[12]

Wang Y, Wang W, Joty S, Hoi S C H. CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 8696–8708

[13]

Guo D, Ren S, Lu S, Feng Z, Tang D, Liu S, Zhou L, Duan N, Svyatkovskiy A, Fu S, Tufano M, Deng S K, Clement C, Drain D, Sundaresan N, Yin J, Jiang D, Zhou M. GraphCodeBERT: pre-training code representations with data flow. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[14]

Dabre R, Chu C, Kunchukuttan A . A survey of multilingual neural machine translation. ACM Computing Surveys (CSUR), 2020, 53( 5): 99

[15]

Le T H M, Chen H, Babar M A . Deep learning for source code modeling and generation: models, applications, and challenges. ACM Computing Surveys (CSUR), 2020, 53( 3): 62

[16]

Yang Y, Xia X, Lo D, Grundy J . A survey on deep learning for software engineering. ACM Computing Surveys (CSUR), 2022, 54( 10s): 206

[17]

Jain A, Patel H, Nagalapatti L, Gupta N, Mehta S, Guttula S, Mujumdar S, Afzal S, Sharma Mittal R, Munigala V. Overview and importance of data quality for machine learning tasks. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020, 3561−3562

[18]

Whang S E, Lee J G . Data collection and quality challenges for deep learning. Proceedings of the VLDB Endowment, 2020, 13( 12): 3429–3432

[19]

Elman J L . Finding structure in time. Cognitive Science, 1990, 14( 2): 179–211

[20]

Hochreiter S, Schmidhuber J . Long short-term memory. Neural Computation, 1997, 9( 8): 1735–1780

[21]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017, 6000−6010

[22]

Ahmad W, Chakraborty S, Ray B, Chang K W. Unified pre-training for program understanding and generation. In: Proceedings of 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021, 2655−2668

[23]

Zhang C, Wang J, Zhou Q, Xu T, Tang K, Gui H, Liu F . A survey of automatic source code summarization. Symmetry, 2022, 14( 3): 471

[24]

Zhu Y, Pan M. Automatic code summarization: a systematic literature review. 2019, arXiv preprint arXiv: 1909.04352

[25]

Wang Z, Cuenca G, Zhou S, Xu F F, Neubig G. MCoNaLa: a benchmark for code generation from multiple natural languages. In: Proceedings of the Association for Computational Linguistics: EACL 2023. 2023, 265–273

[26]

Zhang F, Zhang Z, Keung J W, Tang X, Yang Z, Yu X, Hu W . Data preparation for deep learning based code smell detection: a systematic literature review. Journal of Systems and Software, 2024, 216: 112131

[27]

Liu H, Jin J, Xu Z, Zou Y, Bu Y, Zhang L . Deep learning based code smell detection. IEEE Transactions on Software Engineering, 2021, 47( 9): 1811–1837

[28]

Croft R, Xie Y, Babar M A . Data preparation for software vulnerability prediction: a systematic literature review. IEEE Transactions on Software Engineering, 2023, 49( 3): 1044–1063

[29]

Watson C, Cooper N, Palacio D N, Moran K, Poshyvanyk D . A systematic literature review on the use of deep learning in software engineering research. ACM Transactions on Software Engineering and Methodology (TOSEM), 2022, 31( 2): 32

[30]

Ghaisas S, Singhal A. Dealing with data for RE: mitigating challenges while using NLP and generative AI. In: Ferrari A, Ginde G, eds. Handbook on Natural Language Processing for Requirements Engineering. Cham: Springer, 2025, 457−486

[31]

Wang S, Huang L, Gao A, Ge J, Zhang T, Feng H, Satyarth I, Li M, Zhang H, Ng V . Machine/deep learning for software engineering: a systematic literature review. IEEE Transactions on Software Engineering, 2023, 49( 3): 1188–1231

[32]

Kitchenham B. Procedures for performing systematic reviews. Keele: Keele University, 2004, 1−26

[33]

Zhang H, Babar M A, Tell P . Identifying relevant studies in software engineering. Information and Software Technology, 2011, 53( 6): 625–637

[34]

Schardt C, Adams M B, Owens T, Keitz S, Fontelo P . Utilization of the PICO framework to improve searching PubMed for clinical questions. BMC Medical Informatics and Decision Making, 2007, 7: 16

[35]

GitHub. Corporation for digital scholarship. See Github.com/digitalscholar website, 2023

[36]

Liu B, Wang T, Zhang X, Fan Q, Yin G, Deng J. A neural-network based code summarization approach by using source code and its call dependencies. In: Proceedings of the 11th Asia-Pacific Symposium on Internetware. 2019, 12

[37]

Liu S, Chen Y, Xie X, Siow J K, Liu Y. Retrieval-augmented generation for code summarization via hybrid GNN. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[38]

Choi Y, Kim S, Lee J H. Source code summarization using attention-based keyword memory networks. In: Proceedings of 2020 IEEE International Conference on Big Data and Smart Computing (BigComp). 2020, 564−570

[39]

Yin P, Deng B, Chen E, Vasilescu B, Neubig G. Learning to mine parallel natural language/source code corpora from stack overflow. In: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. 2018, 388−389

[40]

Hasan M, Muttaqueen T, Ishtiaq A A, Mehrab K S, Haque M M A, Hasan T, Ahmad W, Iqbal A, Shahriyar R. CoDesc: a large code-description parallel dataset. In: Proceedings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021, 210−218

[41]

LeClair A, McMillan C. Recommendations for datasets for source code summarization. In: Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 3931−3937

[42]

Mastropaolo A, Ciniselli M, Pascarella L, Tufano R, Aghajani E, Bavota G. Towards summarizing code snippets using pre-trained transformers. In: Proceedings of the 32nd IEEE/ACM International Conference on Program Comprehension. 2024, 1−12

[43]

Shi L, Mu F, Chen X, Wang S, Wang J, Yang Y, Li G, Xia X, Wang Q. Are we building on the rock? On the importance of data preprocessing for code summarization. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2022, 107−119

[44]

Ding X, Peng R, Chen X, Huang Y, Bian J, Zheng Z . Do code summarization models process too much information? Function signature may be all that is needed. ACM Transactions on Software Engineering and Methodology, 2024, 33( 6): 160

[45]

Song Z, Shang X, Li M, Chen R, Li H, Guo S. Do not have enough data? An easy data augmentation for code summarization. In: Proceedings of the 13th IEEE International Symposium on Parallel Architectures, Algorithms and Programming (PAAP). 2022, 1−6

[46]

Zhu T, Li Z, Pan M, Shi C, Zhang T, Pei Y, Li X. Revisiting information retrieval and deep learning approaches for code summarization. In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 2023, 328−329

[47]

Madeiral F, Urli S, Maia M, Monperrus M. BEARS: an extensible java bug benchmark for automatic program repair studies. In: Proceedings of the 26th IEEE international conference on software analysis, evolution and reengineering (SANER). 2019, 468−478

[48]

Jiang Y, Liu H, Luo X, Zhu Z, Chi X, Niu N, Zhang Y, Hu Y, Bian P, Zhang L . BugBuilder: an automated approach to building bug repository. IEEE Transactions on Software Engineering, 2023, 49( 4): 1443–1463

[49]

Fakhoury S, Chakraborty S, Musuvathi M, Lahiri S K. Nl2fix: generating functionally correct code edits from bug descriptions. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 410−411

[50]

Tang H, Nadi S . On using stack overflow comment-edit pairs to recommend code maintenance changes. Empirical Software Engineering, 2021, 26( 4): 68

[51]

Antal G, Vándor N, Kolláth I, Mosolygó B, Hegedűs P, Ferenc R . PyBugHive: a comprehensive database of manually validated, reproducible python bugs. IEEE Access, 2024, 12: 123739–123756

[52]

Lin D, Koppel J, Chen A, Solar-Lezama A. QuixBugs: a multi-lingual program repair benchmark set based on the quixey challenge. In: Proceedings of 2017 ACM SIGPLAN International Conference on Systems, Programming, Languages, and Applications: Software for Humanity. 2017, 55−56

[53]

Campos E C, de Almeida Maia M. Common bug-fix patterns: a large-scale observational study. In: Proceedings of 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 2017, 404−413

[54]

Ye H, Martinez M, Durieux T, Monperrus M . A comprehensive study of automatic program repair on the QuixBugs benchmark. Journal of Systems and Software, 2021, 171: 110825

[55]

Yang D, Lei Y, Mao X, Lo D, Xie H, Yan M. Is the ground truth really accurate? Dataset purification for automated program repair. In: Proceedings of 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 2021, 96−107

[56]

Zhong W, Li C, Zhang Y, Ge Z, Wang J, Ge J, Luo B. An automated and flexible multilingual bug-fix dataset construction system. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2023, 1881−1886

[57]

Yadav A S, Wilson J N. BOSS: a dataset to train ML-based systems to repair programs with out-of-bounds write flaws. In: Proceedings of the 5th ACM/IEEE International Workshop on Automated Program Repair. 2024, 26−33

[58]

Yang Y, He T, Feng Y, Liu S, Xu B . Mining python fix patterns via analyzing fine-grained source code changes. Empirical Software Engineering, 2022, 27( 2): 48

[59]

Pramod D, De Silva T, Thabrew U, Shariffdeen R, Wickramanayake S. BugsPHP: a dataset for automated program repair in PHP. In: Proceedings of the 21st International Conference on Mining Software Repositories. 2024, 128−132

[60]

Wu Y, Li Z, Zhang J M, Liu Y. ConDefects: a complementary dataset to address the data leakage concern for LLM-based fault localization and program repair. In: Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 2024, 642−646

[61]

Jin M, Shahriar S, Tufano M, Shi X, Lu S, Sundaresan N, Svyatkovskiy A. Inferfix: end-to-end program repair with LLMs. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2023, 1646−1656

[62]

Chen Z, Kommrusch S, Tufano M, Pouchet L N, Poshyvanyk D, Monperrus M. SequenceR: sequence-to-sequence learning for end-to-end program repair. IEEE Transactions on Software Engineering, 2021, 47(9): 1943−1959

[63]

Richter C, Wehrheim H. TSSB-3M: mining single statement bugs at massive scale. In: Proceedings of the 19th International Conference on Mining Software Repositories. 2022, 418−422

[64]

Bai J, Zhou L, Blanco A, Liu S, Wei F, Zhou M, Li Z. Jointly learning to repair code and generate commit message. In: Proceedings of 2021 Conference on Empirical Methods in Natural Language Processing. 2021, 9784−9795

[65]

Xia C S, Wei Y, Zhang L. Automated program repair in the era of large pre-trained language models. In: Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). 2023, 1482−1494

[66]

Kacmajor M, Kelleher J D . Automatic acquisition of annotated training corpora for test-code generation. Information, 2019, 10( 2): 66

[67]

Zhong L, Wang Z. Can LLM replace stack overflow? A study on robustness and reliability of large language model code generation. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 21841−21849

[68]

Yu H, Shen B, Ran D, Zhang J, Zhang Q, Ma Y, Liang G, Li Y, Wang Q, Xie T. CoderEval: a benchmark of pragmatic code generation with generative pretrained models. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 428−439

[69]

Lai Y, Li C, Wang Y, Zhang T, Zhong R, Zettlemoyer L, Yih W T, Fried D, Wang S, Yu T. DS-1000: a natural and reliable benchmark for data science code generation. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 756

[70]

Du X, Liu M, Wang K, Wang H, Liu J, Chen Y, Feng J, Sha C, Peng X, Lou Y. Evaluating large language models in class-level code generation. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 2024, 982−994

[71]

Tiwari S P, Prasad S, Thushara M. Machine learning for translating pseudocode to python: a comprehensive review. In: Proceedings of the 7th International Conference on Intelligent Computing and Control Systems (ICICCS). 2023, 274−280

[72]

Siddiq M L, Santos J C S. SecurityEval dataset: mining vulnerability examples to evaluate machine learning-based code generation techniques. In: Proceedings of the 1st International Workshop on Mining Software Repositories Applications for Privacy and Security. 2022, 29−33

[73]

Li J, Sangalay A, Cheng C, Tian Y, Yang J. Fine tuning large language model for secure code generation. In: Proceedings of 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering. 2024, 86−90

[74]

Peng Q, Chai Y, Li X. HumanEval-XL: a multilingual code generation benchmark for cross-lingual natural language generalization. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 8383−8394

[75]

Feng Y, Vanam S, Cherukupally M, Zheng W, Qiu M, Chen H. Investigating code generation performance of ChatGPT with crowdsourcing social data. In: Proceedings of the 47th IEEE Annual Computers, Software, and Applications Conference (COMPSAC). 2023, 876−885

[76]

Agashe R, Iyer S, Zettlemoyer L. JuICe: a large scale distantly supervised dataset for open domain context-based code generation. In: Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019, 5436−5446

[77]

Tony C, Mutas M, Ferreyra N E D, Scandariato R. LLMSecEval: a dataset of natural language prompts for security evaluations. In: Proceedings of the 20th IEEE/ACM International Conference on Mining Software Repositories (MSR). 2023, 588−592

[78]

Akinobu Y, Kajiura T, Obara M, Kuramitsu K . NMT-based code generation for coding assistance with natural language. Journal of Information Processing, 2022, 30: 443–450

[79]

Li H S, Mesgar M, Martins A, Gurevych I. Python code generation by asking clarification questions. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 14287−14306

[80]

Shen S, Zhu X, Dong Y, Guo Q, Zhen Y, Li G. Incorporating domain knowledge through task augmentation for front-end JavaScript code generation. In: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2022, 1533−1543

[81]

Yin P, Deng B, Chen E, Vasilescu B, Neubig G. Learning to mine aligned code and natural language pairs from stack overflow. In: Proceedings of the 15th International Conference on Mining Software Repositories. 2018, 476−486

[82]

Athiwaratkun B, Gouda S K, Wang Z, Li X, Tian Y, Tan M, Ahmad W U, Wang S, Sun Q, Shang M, Gonugondla S K, Ding H, Kumar V, Fulton N, Farahani A, Jain S, Giaquinto R, Qian H, Ramanathan M K, Nallapati R. Multi-lingual evaluation of code generation models. In: Proceedings of the 11th International Conference on Learning Representations. 2023

[83]

Zhu J, Shen M. Research on deep learning based code generation from natural language description. In: Proceedings of the 5th IEEE International Conference on Cloud Computing and Big Data Analytics (ICCCBDA). 2020, 188−193

[84]

Cosma A, Iordache I B, Rosso P. RoCode: a dataset for measuring code intelligence from problem definitions in Romanian. In: Proceedings of 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. 2024, 14173−14185

[85]

Liu J, Xia C S, Wang Y, Zhang L. Is your code generated by ChatGPT really correct? Rigorous evaluation of large language models for code generation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 943

[86]

Jain N, Zhang T, Chiang W L, Gonzalez J E, Sen K, Stoica I. LLM-assisted code cleaning for training accurate code generators. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[87]

Yan W, Tian Y, Li Y, Chen Q, Wang W. CodeTransOcean: a comprehensive multilingual benchmark for code translation. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 5067−5089

[88]

Lei B, Ding C, Chen L, Lin P H, Liao C. Creating a dataset for high-performance computing code translation using LLMs: a bridge between OpenMP Fortran and C++. In: Proceedings of 2023 IEEE High Performance Extreme Computing Conference (HPEC). 2023, 1−7

[89]

Jiao M, Yu T, Li X, Qiu G, Gu X, Shen B. On the evaluation of neural code translation: taxonomy and benchmark. In: Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2023, 1529−1541

[90]

Roziere B, Lachaux M A, Chanussot L, Lample G. Unsupervised translation of programming languages. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1730

[91]

Rithy I J, Shakil H H, Mondal N, Sultana F, Shah F M. XTest: a parallel multilingual corpus with test cases for code translation and its evaluation. In: Proceedings of the 25th International Conference on Computer and Information Technology (ICCIT). 2022, 623−628

[92]

Xie Y, Naik A, Fried D, Rose C. Data augmentation for code translation with comparable corpora and multiple references. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 13725−13739

[93]

Chen B, Golebiowski J, Abedjan Z. Data augmentation for supervised code translation learning. In: Proceedings of the 1st IEEE/ACM International Conference on Mining Software Repositories (MSR). 2024, 444−456

[94]

Chen B, Golebiowski J, Abedjan Z. Towards data augmentation for supervised code translation. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 352−353

[95]

Pan R, Ibrahimzada A R, Krishna R, Sankar D, Wassi L P, Merler M, Sobolev B, Pavuluri R, Sinha S, Jabbarvand R. Lost in translation: a study of bugs introduced by large language models while translating code. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 2024, 82

[96]

Nguyen D, Nam L, Dau A, Nguyen A, Nghiem K, Guo J, Bui N. The vault: a comprehensive multilingual dataset for advancing code understanding and generation. In: Proceedings of the Association for Computational Linguistics: EMNLP 2023. 2023, 4763−4788

[97]

Chen P, Lampouras G. Exploring data augmentation for code generation tasks. In: Proceedings of the Association for Computational Linguistics: EACL 2023. 2023, 1542−1550

[98]

Ahmed T, Devanbu P. Multilingual training for software engineering. In: Proceedings of the 44th International Conference on Software Engineering. 2022, 1443−1455

[99]

Liu C, Lu S, Chen W, Jiang D, Svyatkovskiy A, Fu S, Sundaresan N, Duan N. Code execution with pre-trained language models. In: Proceedings of the Association for Computational Linguistics: ACL 2023. 2023, 4984−4999

[100]

Saavedra N, Silva A, Monperrus M. Gitbug-actions: building reproducible bug-fix benchmarks with GitHub actions. In: Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: Companion Proceedings. 2024, 1−5

[101]

Silva A, Saavedra N, Monperrus M. GitBug-Java: a reproducible benchmark of recent java bugs. In: Proceedings of the 1st IEEE/ACM International Conference on Mining Software Repositories (MSR). 2024, 118−122

[102]

Wu Y, Jiang N, Pham H V, Lutellier T, Davis J, Tan L, Babkin P, Shah S. How effective are neural networks for fixing security vulnerabilities. In: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. 2023, 1282−1294

[103]

Karampatsis R M, Sutton C. How often do single-statement bugs occur?: the ManySStuBs4J dataset. In: Proceedings of the 17th International Conference on Mining Software Repositories. 2020, 573−577

[104]

Liu K, Han Y, Liu Y, Zhang J M, Chen Z, Sarro F, Huang G, Ma Y. TrickyBugs: a dataset of corner-case bugs in plausible programs. In: Proceedings of the 21st International Conference on Mining Software Repositories. 2024, 113−117

[105]

Herbold S, Trautsch A, Ledel B. Large-scale manual validation of bugfixing changes. In: Proceedings of the 17th International Conference on Mining Software Repositories. 2020, 611−614

[106]

Allamanis M, Barr E T, Devanbu P, Sutton C . A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR), 2018, 51( 4): 81

[107]

Nguyen H A, Nguyen T T, Wilson G Jr, Nguyen A T, Kim M, Nguyen T N . A graph-based approach to API usage adaptation. ACM SIGPLAN Notices, 2010, 45( 10): 302–321

[108]

Husain H, Wu H H, Gazit T, Allamanis M, Brockschmidt M. CodeSearchNet challenge: evaluating the state of semantic code search. 2019, arXiv preprint arXiv: 1909.09436

[109]

Papineni K, Roukos S, Ward T, Zhu W J. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002, 311−318

[110]

Banerjee S, Lavie A. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 2005, 65−72

[111]

Lin C Y. ROUGE: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out. 2004, 74−81

[112]

Radford A, Narasimhan K, Salimans T, Sutskever H. Improving language understanding by generative pre-training. 2018

[113]

Chen M, Tworek J, Jun H, Yuan Q, de Oliveira Pinto H P, Kaplan J, Edwards H, Burda Y, Joseph N, Brockman G, Ray A, Puri R, Krueger G, Petrov M, Khlaaf H, Sastry G, Mishkin P, Chan B, Gray S, Ryder N, Pavlov M, Power A, Kaiser L, Bavarian M, Winter C, Tillet P, Such F P, Cummings D, Plappert M, Chantzis F, Barnes E, Herbert-Voss A, Guss W H, Nichol A, Paino A, Tezak N, Tang J, Babuschkin I, Balaji S, Jain S, Saunders W, Hesse C, Carr A N, Leike J, Achiam J, Misra V, Morikawa E, Radford A, Knight M, Brundage M, Murati M, Mayer K, Welinder P, McGrew B, Amodei D, McCandlish S, Sutskever I, Zaremba W. Evaluating large language models trained on code. 2021, arXiv preprint arXiv: 2107.03374

[114]

Rozière B, Gehring J, Gloeckle F, Sootla S, Gat I, Tan X E, Adi Y, Liu J, Sauvestre R, Remez T, Rapin J, Kozhevnikov A, Evtimov I, Bitton J, Bhatt M, Ferrer C C, Grattafiori A, Xiong W, Défossez A, Copet J, Azhar F, Touvron H, Martin L, Usunier N, Scialom T, Synnaeve G. Code Llama: open foundation models for code. 2023, arXiv preprint arXiv: 2308.12950

RIGHTS & PERMISSIONS

The Author(s) 2025. This article is published with open access at link.springer.com and journal.hep.com.cn

AI Summary AI Mindmap
PDF (2023KB)

Supplementary files

Highlights

425

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/