RAGA: a reference-assisted genome assembly tool for efficient population-scale assembly

Ru-Peng Zhao , Yu-Hong Luo , Wen-Zhao Xie , Zu-Wen Zhou , Yong-Qing Qian , Si-Long Yuan , Dong-Ao Li , Jiana Li , Kun Lu , Xingtan Zhang , Jia-Ming Song , Ling-Ling Chen

Horticulture Research ›› 2025, Vol. 12 ›› Issue (11) : 207

PDF (1561KB)
Horticulture Research ›› 2025, Vol. 12 ›› Issue (11) :207 DOI: 10.1093/hr/uhaf207
Article
research-article

RAGA: a reference-assisted genome assembly tool for efficient population-scale assembly

Author information +
History +
PDF (1561KB)

Abstract

High-quality reference genomes at the population scale are fundamental for advancing pan-genomic research. However, high-quality genome assembly at the population scale is costly and time-consuming. To overcome these limitations, we developed Reference-Assisted Genome Assembly (RAGA), a hybrid computational tool that combines de novo and reference-based assembly approaches. RAGA efficiently employs existing reference genomes from the same or closely related species in combination with PacBio HiFi reads to produce high-quality alternative long sequences. These sequences can be integrated with de novo assemblies to improve assembly quality across population-scale datasets. The performance of RAGA across various plant genomes demonstrated its ability to reduce the number of contigs, decrease gaps, and correct genome assembly errors. The implementation of RAGA (available at https://github.com/wzxie/RAGA) significantly streamlines population-scale genome assembly workflows, providing a robust foundation for comprehensive pan-genomic investigations. This tool represents a substantial advancement in making large-scale genomic studies more accessible and efficient.

Cite this article

Download citation ▾
Ru-Peng Zhao, Yu-Hong Luo, Wen-Zhao Xie, Zu-Wen Zhou, Yong-Qing Qian, Si-Long Yuan, Dong-Ao Li, Jiana Li, Kun Lu, Xingtan Zhang, Jia-Ming Song, Ling-Ling Chen. RAGA: a reference-assisted genome assembly tool for efficient population-scale assembly. Horticulture Research, 2025, 12(11): 207 DOI:10.1093/hr/uhaf207

登录浏览全文

4963

注册一个新账户 忘记密码

Acknowledgments

We thank all labmates in the Chen lab for their generous help. This work was supported by the Guangxi Science and Technology Major Program (guikeAA23062085), the National Natural Science Foundation of China (NSFC) (U24A20369, 32270712, and 32100526), Guangxi Natural Science Foundation (2024GXNSFGA010003), Fundamental Research Funds for the Central Universities (SWU-KR24030), State Key Laboratory for Conservation and Utilization of Subtropical Agro-Bioresources (SKLCUSA-a202306), Young Elite Scientists Sponsorship Program by CAST (2022QNRC001), and the talent and technology plan project of Bama county (No. 20220008).

Author contributions

L.-L.C. and J.-M.S. conceived and designed the study. R.-P.Z., Y.-H.L., W.-Z.X., Z.-W.Z., Y.-Q.Q., S.-L.Y., D.-A.L., J.L., K.L., and X.Z. participated in data analysis and substantively revised the manuscript.

The assembly and raw reads for rice MH63 can be accessed at CP054676-CP054688, SRX6957825, SRX6908794, SRX6716809, and SRR13285939 in the National Center for Biotechnology Information (NCBI) [24]. The published T2T genome for rice is available at CP056052-CP056064, CP054676-CP054688 in NCBI, and PRJCA018610 and PRJCA008812 in NGDC [24,25]. The PacBio HiFi reads for Xiangling628S (XL628S), Longke638S (LK638S), and Jing4155S (J4155S) can be obtained from PRJCA008812 in the NGDC [26,33].

The raw reads and assembly for soybean ZH13 can be found at PRJCA015269 and GWHBWDJ00000000.1 in NGDC [33]. The assemblies for soybean Williams 82 and Jack are available at PRJNA975879 and PRJNA701655 in NCBI [36,37].

The T2T genome for S. rufipilum can be found at PRJCA014818 in NGDC [56]. The raw reads for S. spontaneum genome are available at PRJNA721787 in NCBI [53]. The raw reads for E. colona genome are available at PRJCA003883 in NGDC [52].

The assembly and raw reads for P. communis can be accessed at PRJNA992953 in NCBI [28]. The assembly and raw reads for P. vulgaris cv are available at PRJNA931244 in NCBI [29]. The raw reads for C. australis can be found at PRJNA910964 in NCBI, and the assembly is available at PRJCA013889 [32]. The raw reads and assembly for E. peplus [30] are available at PRJNA837952 in NCBI. The assembly and raw reads for M. acuminata [31] can be accessed at PRJNA1017453 in NCBI. The T2T genome for YunhongNO.1 can be found at (http://pyrusgdb.sdau.edu.cn/) [57]. The T2T lemon genome is available at GWHCBFQ00000000.1 in the CNCB Genome Warehouse [58]. The cassava XX048 T2T genome can be accessed at PRJCA016162 in NGDC [59], and the banana Cavendish T2T genome is available at PRJNA957115 in NCBI [60].

The raw reads and assembly for kiwifruit ‘Hongyang’ [34] can be obtained from the NCBI Sequence Read Archive (SRA) under the accession number PRJNA869178. The genome of kiwifruit ‘Donghong’ is available at the NGDC (National Genomics Data Center) with the Bioproject ID PRJCA014123. The genome and raw reads for human ‘HG002’ can be accessed via https://github.com/marbl/hg002,and the genome of human ‘CHM13’ is available at https://github.com/marbl/CHM13.

Data availability

The authors declare that they have no competing interests.

The entire code and usage instructions for RAGA can be obtained at (https://github.com/wzxie/RAGA). All assemblies and raw reads used in this study are available in public databases. The 32 A. thaliana genome assemblies can be accessed at (https://figshare.com/articles/dataset/32_ecotypes_Arabidopsis_thaliana_genomes_gene_annotation_pan-TE_library_graph_pan-genome_gene_family_and_gene_presence_absence_matrices_files_/21673895), and the corresponding raw reads can be found at PRJCA012695 in the National Genomics Data Center (NGDC) [17, 68]. The PacBio HiFi reads for the additional 48 A. thaliana samples can be accessed in EMBL-ENA under the accession number PRJEB62038 [18]. The genome assemblies for these samples can be accessed in NCBI under the accession number PRJNA1033522 [18]. The published T2T A. thaliana genome is available at GWHBDNP00000000.1 in NGDC, (https://github.com/schatzlab/Col-CEN), and PRJCA007112 in NGDC [20-22].

Conflict of interest statement

The authors declare that they have no competing interests.

Supplementary Data

Supplementary data is available at Horticulture Research online.

References

[1]

Shi J, Tian Z, Lai J. et al. Plant pan-genomics and its applications. Mol Plant. 2023; 16:168-86

[2]

Wang S, Qian Y-Q, Zhao R-P. et al. Graph-based pan-genomes: increased opportunities in plant genomics. JExp Bot. 2022; 74: 24-39

[3]

Liao W-W, Asri M, Ebler J. et al. A draft human pangenome reference. Nature. 2023; 617:312-24

[4]

Zhang Y, Zhao M, Tan J. et al. Telomere-to-telomere Citrullus super-pangenome provides direction for watermelon breeding. Nat Genet. 2024; 56:1750-61

[5]

Liu Z, Zhang C, He J. et al. plantGIR: a genomic database of plants. Hortic Res. 2024;11:uhae342

[6]

Liu J, Huang C, Xing D. et al. The genomic database of fruits: a comprehensive fruit information database for comparative and functional genomic studies. Agric Commun. 2024; 2:100041

[7]

Liu Z, Shen S, Li C. et al. SoIR: a comprehensive Solanaceae information resource for comparative and functional genomic study. Nucleic Acids Res. 2025;53:D1623-32

[8]

Rautiainen M, Nurk S, Walenz BP. et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol. 2023; 41:1474-82

[9]

Cheng H, Asri M, Lucas J. et al. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nat Methods. 2024; 21:967-70

[10]

Espinosa E, Bautista R, Larrosa R. et al. Advancements in long-read genome sequencing technologies and algorithms. Genomics. 2024; 116:110842

[11]

Huang Y, Wang Z, Schmidt MA. et al. DEGAP: dynamic elongation of a genome assembly path. Brief Bioinform. 2024;25:bbae194

[12]

Xu M, Guo L, Gu S. et al. TGS-GapCloser: a fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience. 2020;9:giaa094

[13]

Hu J, Wang Z, Liang F. et al. NextPolish2: a repeat-aware polishing tool for genomes assembled using HiFi long reads. Genom Proteom Bioinform. 2024;22:qzad009

[14]

Alonge M, Soyk S, Ramakrishnan S. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol. 2019; 20:224

[15]

Alonge M, Lebeigle L, Kirsche M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 2022; 23:258

[16]

Lin Y, Ye C, Li X. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identifi-cation. Hortic Res. 2023;10:uhad127

[17]

Kang M, Wu H, Liu H. et al. The pan-genome and local adaptation of Arabidopsis thaliana. Nat Commun. 2023; 14:6259

[18]

Lian Q, Huettel B, Walkemeier B. et al. A pan-genome of 69 Ara-bidopsis thaliana accessions reveals a conserved genome struc-ture throughout the global species range. Nat Genet. 2024; 56: 982-91

[19]

Qin P, Lu H, Du H. et al. Pan-genome analysis of 33 genetically diverse rice accessions reveals hidden genomic variations. Cell. 2021; 184:3542-3558.e16

[20]

Wang B, Yang X, Jia Y. et al. High-quality Arabidopsis thaliana genome assembly with Nanopore and HiFi long reads. Genom Proteom Bioinform. 2022; 20:4-13

[21]

Naish M, Alonge M, Wlodzimierz P. et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science. 2021;374:eabi7489

[22]

Hou X, Wang D, Cheng Z. et al. A near-complete assembly of an Arabidopsis thaliana genome. Mol Plant. 2022; 15:1247-50

[23]

Manni M, Berkeley MR, Seppey M. et al. BUSCO update: novel and streamlined workflows along with broader and deeper phyloge-netic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol. 2021; 38:4647-54

[24]

Song J-M, Xie W-Z, Wang S. et al. Two gap-free reference genomes and a global view of the centromere architecture in rice. Mol Plant. 2021; 14:1757-67

[25]

Shang L, He W, Wang T. et al. A complete assembly of the rice Nipponbare reference genome. Mol Plant. 2023; 16:1232-6

[26]

Zhang Y, Fu J, Wang K. et al. The telomere-to-telomere gap-free genome of four rice parents reveals SV and PAV patterns in hybrid rice breeding. Plant Biotechnol J. 2022; 20:1642-4

[27]

Ono Y, Hamada M, Asai K. PBSIM3: a simulator for all types of PacBio and ONT long reads. NAR Genom Bioinform. 2022;4:lqac092

[28]

Yocca A, Akinyuwa M, Bailey N. et al. A chromosome-scale assembly for ‘d’Anjou’ pear. G3. 2024;14:jkae003

[29]

Carrère S, Mayjonade B, Lalanne D. et al. First whole genome assembly and annotation of a European common bean cultivar using PacBio HiFi and Iso-Seq data. Data in Brief. 2023; 48:109182

[30]

Johnson AR, Yue Y, Carey SB. et al. Chromosome-level genome assembly of Euphorbia peplus, a model system for plant latex, reveals that relative lack of Ty3 transposons contributed to its small genome size. Genome Biol and Evol. 2023;15:evad018

[31]

Li X, Yu S, Cheng Z. et al. Origin and evolution of the triploid cultivated banana genome. Nat Genet. 2024; 56:136-42

[32]

Nakandala U, Masouleh AK, Smith MW. et al. Haplotype resolved chromosome level genome assembly of Citrus australis reveals disease resistance and other citrus specific genes. Hortic Res. 2023;10:uhad058

[33]

Zhang C, Xie L, Yu H. et al. The T2T genome assembly of soybean cultivar ZH13 and its epigenetic landscapes. Mol Plant. 2023; 16: 1715-8

[34]

Yue J, Chen Q, Wang Y. et al. Telomere-to-telomere and gap-free reference genome assembly of the kiwifruit Actinidia chinensis. Hortic Res. 2022;10:uhac264

[35]

Rhie A, Nurk S, Cechova M. et al. The complete sequence of a human Y chromosome. Nature. 2023; 621:344-54

[36]

Wang L, Zhang M, Li M. et al. A telomere-to-telomere gap-free assembly of soybean genome. Mol Plant. 2023; 16:1711-4

[37]

Huang Y, Koo DH, Mao Y. et al. A complete reference genome for the soybean cv. Jack Plant Commun. 2024; 5:100765

[38]

Nurk S, Koren S, Rhie A. et al. The complete sequence of a human genome. Science. 2022; 376:44-53

[39]

Jarvis ED, Formenti G, Rhie A. et al. Semi-automated assem-bly of high-quality diploid human reference genomes. Nature. 2022; 611:519-31

[40]

Benoit M, Jenike KM, Satterlee JW. et al. Solanum pan-genetics reveals paralogues as contingencies in crop engineering. Nature. 2025; 640:135-45

[41]

Cheng H, Jarvis ED, Fedrigo O. et al. Haplotype-resolved assem-bly of diploid genomes without parental data. Nat Biotechnol. 2022; 40:1332-5

[42]

Luan T, Cepeda V, Liu B. et al. MetaCompass: reference-guided assembly of metagenomes. ArXiv. 2024;Mar 3:arXiv:2403.01578v1

[43]

Liu S, Li K, Dai X. et al. A telomere-to-telomere genome assembly coupled with multi-omic data provides insights into the evolu-tion of hexaploid bread wheat. Nat Genet. 2025; 57:1008-20

[44]

Song Y, Peng Y, Liu L. et al. Phased gap-free genome assem-bly of octoploid cultivated strawberry illustrates the genetic and epigenetic divergence among subgenomes. Hortic Res. 2023;11:uhad252

[45]

Li H. New strategies to improve minimap2 alignment accuracy. Bioinformatics. 2021; 37:4572-4

[46]

Vaser R, Sović I, Nagarajan N. et al. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017; 27:737-46

[47]

Marçais G, Delcher AL, Phillippy AM. et al. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol. 2018; 14:e1005944

[48]

Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010; 26:841-2

[49]

Danecek P, Bonfield JK, Liddle J. et al. Twelve years of SAMtools and BCFtools. GigaScience. 2021;10:giab008

[50]

Chen S, Zhou Y, Chen Y. et al. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34:i884-90

[51]

De Coster W, D’Hert S, Schultz DT. et al. NanoPack: visualiz-ing and processing long-read sequencing data. Bioinformatics. 2018; 34:2666-9

[52]

Wu D, Shen E, Jiang B. et al. Genomic insights into the evolution of Echinochloa species as weed and orphan crop. Nat Commun. 2022; 13:689

[53]

Zhang Q, Qi Y, Pan H. et al. Genomic insights into the recent chromosome reduction of autopolyploid sugarcane Saccharum spontaneum. Nat Genet. 2022; 54:885-96

[54]

Shen W, Le S, Li Y. et al. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q file manipulation. PLoS One. 2016; 11:e0163962

[55]

Han X, Zhang Y, Zhang Q. et al. Two haplotype-resolved, gap-free genome assemblies for Actinidia latifolia and Actinidia chinensis shed light on the regulatory mechanisms of vitamin C and sucrose metabolism in kiwifruit. Mol Plant. 2023; 16:452-70

[56]

Wang T, Wang B, Hua X. et al. A complete gap-free diploid genome in Saccharum complex and the genomic footprints of evolution in the highly polyploid Saccharum genus. Nat Plants. 2023; 9:554-71

[57]

Sun M, Yao C, Shu Q. et al. Telomere-to-telomere pear (Pyrus pyrifolia) reference genome reveals segmental and whole genome duplication driving genome evolution. Hortic Res. 2023;10:uhad201

[58]

Bao Y, Zeng Z, Yao W. et al. A gap-free and haplotype-resolved lemon genome provides insights into flavor synthesis and Huan-glongbing (HLB) tolerance. Hortic Res. 2023;10:uhad020

[59]

Xu X-D, Zhao R-P, Xiao L. et al. Telomere-to-telomere assembly of cassava genome reveals the evolution of cassava and divergence of allelic expression. Hortic Res. 2023;10:uhad200

[60]

Huang H-R, Liu X, Arshad R. et al. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome diver-gence and disease resistance in triploid Cavendish banana. Hor-tic Res. 2023;10:uhad153

[61]

Rhie A, Walenz BP, Koren S. et al. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 2020; 21:245

[62]

Li K, Xu P, Wang J. et al. Identification of errors in draft genome assemblies at single-nucleotide resolution for quality assess-ment and improvement. Nat Commun. 2023; 14:6556

[63]

Zhou ZW, Yu ZG, Huang XM. et al. GenomeSyn: a bioinformatics tool for visualizing genome synteny and structural variations. J Genet Genomics. 2022; 49:1174-6

[64]

Robinson JT, Thorvaldsdottir H, Turner D. et al. Igv.Js: an embed-dable JavaScript implementation of the integrative genomics viewer (IGV). Bioinformatics. 2022;39:btac830

[65]

Yu W, Luo H, Yang J. et al. Comprehensive assessment of 11 de novo HiFi assemblers on complex eukaryotic genomes and metagenomes. Genome Res. 2024; 34:326-40

[66]

Camacho C, Coulouris G, Avagyan V. et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10:421

[67]

Goel M, Sun H, Jiao W-B. et al. SyRI: finding genomic rear-rangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019; 20:277

[68]

Members C-N Partners. Database resources of the National Genomics Data Center. China National Center for Bioinforma-tion in 2023. Nucleic Acids Res. 2022;51:D18-28

AI Summary AI Mindmap
PDF (1561KB)

62

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/