Mango pangenome reveals dramatic impacts of reference bias on population genomic analyses

Bilal Ahmad , Ying Su , Yani Hao , Tayyaba Razzaq , Rida Arshad , Yi Zhang , Yingchun Zhang , Xingyi Wang , Guizhou Huang , Xiangnian Su , Ting Hou , Chaochao Li , Xuanwen Yang , Chuanning Li , Zhenzhou Chu , Qiuyan Wang , Yu Zhang , Zhongxin Jin , Qi Xu , Xiaodong Xu , Yanling Peng , Guiqi Bi , Chengjie Chen , Yeyuan Chen , Hua Xiao , Jianfeng Huang , Yongfeng Zhou , Xinmin Tian

Horticulture Research ›› 2025, Vol. 12 ›› Issue (9) : 166

PDF (3658KB)
Horticulture Research ›› 2025, Vol. 12 ›› Issue (9) :166 DOI: 10.1093/hr/uhaf166
Article
research-article
Mango pangenome reveals dramatic impacts of reference bias on population genomic analyses
Author information +
History +
PDF (3658KB)

Abstract

Most genomic studies start by mapping sequencing data to a reference genome. The quality of reference genome assembly, genetic relatedness to the studied population, and the mapping method employed directly impact variant calling accuracy and subsequent genomic analyses, introducing reference bias and resulting in erroneous conclusions. However, the impacts of reference bias have gained limited attention. This study compared population genomic analyses using four different reference genomes of mango (Mangifera indica), including the two haploid assemblies of haplotype-resolved telomere-to-telomere (T2T) genome assembly, a pangenome, and an older version of the reference genome available on NCBI. The choice of reference genome dramatically impacted the mapping efficiency and resulted in notable differences in calling the genetic variants, particularly structural variations (SVs). Phylogenetic analysis was more sensitive to the reference genome compared to genetic differentiation. Population genomic analyses of artificial selection in domestication and SV hotspot regions varied across reference genomes. Notably, the gene enrichment analyses showed significant differences in the top enriched biological processes depending on the reference genome used. Overall, the mango pangenome outperformed the other reference genomes across various metrics, followed by T2T reference genomes, as they captured greater diversity and effectively reduced reference bias. Our findings highlight the role of the mango pangenome in reducing reference bias and underscore the critical role of reference genome selection, suggesting that it is one of the most important factors in population genomic studies.

Cite this article

Download citation ▾
Bilal Ahmad, Ying Su, Yani Hao, Tayyaba Razzaq, Rida Arshad, Yi Zhang, Yingchun Zhang, Xingyi Wang, Guizhou Huang, Xiangnian Su, Ting Hou, Chaochao Li, Xuanwen Yang, Chuanning Li, Zhenzhou Chu, Qiuyan Wang, Yu Zhang, Zhongxin Jin, Qi Xu, Xiaodong Xu, Yanling Peng, Guiqi Bi, Chengjie Chen, Yeyuan Chen, Hua Xiao, Jianfeng Huang, Yongfeng Zhou, Xinmin Tian. Mango pangenome reveals dramatic impacts of reference bias on population genomic analyses. Horticulture Research, 2025, 12(9): 166 DOI:10.1093/hr/uhaf166

登录浏览全文

4963

注册一个新账户 忘记密码

Acknowledgments

This work was supported by the National Natural Science Foundation of China (no. 32372662), the Science Fund Program for Distinguished Young Scholars of the National Natural Science Foundation of China (Overseas) to Y.Z., and the National Key Research and Development Program of China (nos 2023YFF1000100; 2023YFD2200700). National Natural Science Foundation of China (32360058), and the Central Government Guides Local Science and Technology Development Projects, China (2023ZYZX1224). Supported by the earmarked fund for CARS (CARS-31).

Author Contributions

Y.Z., X.T., and J.H. designed research; B.A., Y.S., Y.H, T.R., R.A., Y.Z., and Y. Z. performed research; B.A., Y.S., X.W., G. H., X.S., T. H., C. L., X.Y., C.L., Z.C., Q.W., Y. Z., Y. C., and Z. J. analyzed data; and B.A., Y.S., Q. X., X. X., Y. P., H. X., G. B., C. C., and Y.Z. wrote the paper.

Data availability

Data have been deposited in NCBI and NGDC under the following bioproject numbers: PRJNA1218505, PRJNA1218506, PRJNA1218522, PRJCA035721, and PRJCA035722. Genome assemblies and annotations are also available on the web site (https://zenodo.org) under number 15798809. All other data are included in the article and/or supporting information.

Conflict of interest statement

The authors declare no competing interest.

Supplementary data

Supplementary data is available at Horticulture Research online.

References

[1]

Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome? Genome Biol. 2019; 20:159

[2]

Prasad A, Lorenzen ED, Westbury MV. Evaluating the role of reference-genome phylogenetic distance on evolutionary infer-ence. Mol Ecol Resour. 2022; 22:45-55

[3]

Günther T, Nettelblad C. The presence and impact of reference bias on population genomic studies of prehistoric human popu-lations. PLoS Genet. 2019; 15:e1008302

[4]

Nurk S, Koren S, Rhie A. et al. The complete sequence of a human genome. Science. 2022; 376:44

[5]

Brandt DYC, Aguiar VRC, Bitarello BD. et al. Mapping bias over-estimates reference allele frequencies at the HLA genes in the 1000 genomes project phase I data. G3 (Bethesda). 2015; 5:931-41

[6]

Shi X, Cao S, Wang X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Hortic Res. 2023;10:uhad061

[7]

Xiao H, Wang Y, Liu W. et al. Impacts of reproductive systems on grapevine genome and breeding. Nat Commun. 2025; 16:2031

[8]

Ferrarini A, Xumerle L, Griggio F. et al. The use of non-variant sites to improve the clinical assessment of whole-genome sequence data. PLoS One. 2015; 10:e0132180

[9]

Barbitoff YA, Bezdvornykh IV, Polev DE. et al. Catching hidden variation: systematic correction of reference minor allele anno-tation in clinical variant calling. Genetics in Medicine. 2018; 20: 360-4

[10]

Sherman RM, Forman J, Antonescu V. et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet. 2019; 51:30-5

[11]

Matsumura H, Hsiao MC, Lin YP. et al. Long-read bitter gourd (Momordica charantia) genome and the genomic architecture of nonclassic domestication. Proc Natl Acad Sci USA. 2020; 117: 14543-51

[12]

Thorburn DJ, Sagonas K, Binzer-Panchal M. et al. Origin matters: using a local reference genome improves measures in popula-tion genomics. Mol Ecol Resour. 2023; 23:1706-23

[13]

Zhou Y, Minio A, Massonnet M. et al. The population genetics of structural variants in grapevine domestication. Nat Plants. 2019; 5:965-79

[14]

Zhang C, Yang Z, Tang D. et al. Genome design of hybrid potato. Cell. 2021; 184:3873-3883.e12

[15]

Hu HF, Scheben A, Verpaalen B. et al. Gene presence/absence variation is associated with abiotic stress responses that may contribute to environmental adaptation. New Phytol. 2022; 233: 1548-55

[16]

Derbyshire MC, Marsh J, Tirnaz S. et al. Diversity of fatty acid biosynthesis genes across the soybean pangenome. Plant Genome. 2023; 16:e20334

[17]

Mango Genome Consortium, Bally ISE, Bombarely A. et al. The ‘Tommy Atkins’ mango genome reveals candidate genes for fruit quality. BMC Plant Biol. 2021; 21:108

[18]

Mukherjee SK, Litz RE. Introduction:botany and importance. In: Litz RE,ed. The Mango:Botany, Production and Uses. CABI: Wallingford, 2009,1-18

[19]

Singh Z, Singh RK, Sane VA. et al. Mango - postharvest biology and biotechnology. CRC Crit Rev Plant Sci. 2013; 32:217-36

[20]

Singh NK, Mahato AK, Jayaswal PK. The Genome Sequence and Transcriptome Studies in Mango (Mangifera indica L.). In: Kole C,ed. The Mango Genome. Springer International Publishing: Cham, 2021,165-86

[21]

Wang P, Luo Y, Huang J. et al. The genome evolution and domes-tication of tropical fruit mango. Genome Biol. 2020; 21:60

[22]

Ma X, Luo X, Wei Y. et al. Chromosome-scale genome and comparative transcriptomic analysis reveal transcriptional reg-ulators of beta-carotene biosynthesis in mango. Front Plant Sci. 2021; 12:749108

[23]

SMRT sequencing generates the chromosome-scale reference genome of tropical fruit mango, Mangifera indica. biorxiv.org

[24]

Zhang C, Yi H, Ye X. et al. Gapless genome assembly and popula-tion genomics highlights diversity of mango germplasms. Hortic Res. 2025;12:uhaf007

[25]

Wijesundara UK, Masouleh AK, Furtado A. et al. A chromosome-level genome of mango exclusively from long-read sequence data. Plant Genome. 2024; 17:e20441

[26]

Teo LL, Kiew R, Set O. et al. Hybrid status of kuwini, Mangifera odorata Griff. (Anacardiaceae) verified by amplified fragment length polymorphism. Mol Ecol. 2002; 11:1465-9

[27]

Ma X, Wu H, Liu B. et al. Genomic diversity, population structure, and genome-wide association reveal genetic differentiation and trait improvements in mango. Hortic Res. 2024;11:uhae153

[28]

Wang W, Zhang Y, Xu C. et al. Cucumber ECERIFERUM1 (CsCER1), which influences the cuticle properties and drought tolerance of cucumber, plays a key role in VLC alkanes biosynthesis. Plant Mol Biol. 2015; 87:219-33

[29]

Zhang H, Lang Z, Zhu JK. Dynamics and function of DNA methy-lation in plants. Nat Rev Mol Cell Biol. 2018; 19:489-506

[30]

Erdmann RM, Picard CL. RNA-directed DNA methylation. PLoS Genet. 2020; 16:e1009034

[31]

Gaudet P, Dessimoz C. Gene ontology: pitfalls, biases, and reme-dies. Gene Ontology Handbook. 2017; 1446:189-205

[32]

Feng Z, Xu K, Kovalev N. et al. Recruitment of Vps34 PI3K and enrichment of PI3P phosphoinositide in the viral replication compartment is crucial for replication of a positive-strand RNA virus. PLoS Pathog. 2019; 15:e1007530

[33]

Morales JA, Gonzalez-Kantun WA, Rodriguez-Zapata LC. et al. The effect of plant stress on phosphoinositides. Cell Biochem Funct. 2019; 37:553-9

[34]

Sang Y, Cui D, Wang X. Phospholipase D and phosphatidic acid-mediated generation of superoxide in Arabidopsis. Plant Physiol. 2001; 126:1449-58

[35]

Shigaki T, Bhattacharyya MK. Decreased inositol 1,4,5-trisphosphate content in pathogen-challenged soybean cells. Mol Plant-Microbe Interact. 2000; 13:563-7

[36]

Akhter S, Uddin MN, Jeong IS. et al. Role of Arabidopsis AtPI4Kgamma3, a type II phosphoinositide 4-kinase, in abiotic stress responses and floral transition. Plant Biotechnol J. 2016; 14: 215-30

[37]

Gill RA, Ahmar S, Ali B. et al. The role of membrane transporters in plant growth and development, and abiotic stress tolerance. Int J Mol Sci. 2021; 22: 12792

[38]

Li Q, Qiao X, Li L. et al. Haplotype-resolved T2T genome assem-blies and pangenome graph of pear reveal diverse patterns of allele-specific expression and the genomic basis of fruit quality traits. Plant Commun. 2024; 5:101000

[39]

Wang L, Huang Y, Liu ZA. et al. Somatic variations led to the selection of acidic and acidless orange cultivars. Nat Plants. 2021; 7:954-65

[40]

Bohling J. Evaluating the effect of reference genome divergence on the analysis of empirical RADseq datasets. Ecol Evol. 2020; 10: 7585-601

[41]

Valiente-Mullor C, Beamud B, Ansari I. et al. One is not enough: on the effects of reference genome for the mapping and sub-sequent analyses of short-reads. PLoS Comput Biol. 2021; 17: e1008678

[42]

Sudmant PH, Rausch T, Gardner EJ. et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015; 526: 75-81

[43]

Liao WW, Asri M, Ebler J. et al. A draft human pangenome reference. Nature. 2023; 617:312-24

[44]

Liu Z, Wang N, Su Y. et al. Grapevine pangenome facilitates trait genetics and genomic breeding. Nat Genet. 2024; 56:2804-14

[45]

Rick JA, Brock CD, Lewanski AL. et al. Reference genome choice and filtering thresholds jointly influence Phylogenomic analy-ses. Syst Biol. 2024; 73:76-101

[46]

Su Y, Yang X, Wang Y. et al. Phased telomere-to-telomere ref-erence genome and pangenome reveal an expansion of resis-tance genes during apple domestication. Plant Physiol. 2024; 195: 2799-814

[47]

Zhou Y, Zhang Z, Bao Z. et al. Graph pangenome captures missing heritability and empowers tomato breeding. Nature. 2022; 606:527-34

[48]

Yang Y, Wu Z, Wu Z. et al. A near-complete assembly of aspara-gus bean provides insights into anthocyanin accumulation in pods. Plant Biotechnol J. 2023; 21:2473-89

[49]

Cheng HY, Concepcion GT, Feng XW. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021; 18:170

[50]

Durand NC, Robinson JT, Shamim MS. et al. Juicebox provides a visualization system for hi-C contact maps with unlimited zoom. Cell Syst. 2016; 3:99-101

[51]

Garg V, Dudchenko O, Wang J. et al. Chromosome-length genome assemblies of six legume species provide insights into genome organization, evolution, and agronomic traits for crop improve-ment. J Adv Res. 2022; 42:315-29

[52]

Durand NC, Shamim MS, Machol I. et al. Juicer provides a one-click system for analyzing loop-resolution hi-C experiments. Cell Syst. 2016; 3:95-8

[53]

Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018; 34:3094-100

[54]

Mapleson D, Accinelli GG, Kettleborough G. et al. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics. 2017; 33:574-6

[55]

Peng Y, Wang Y, Liu Y. et al. The genomic and epigenomic landscapes of hemizygous genes across crops with contrasting reproductive systems. Proc Natl Acad Sci. 2025; 122:e2422487122

[56]

Ranallo-Benavidez TR, Jaron KS, Schatz MC. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 2020; 11:1432

[57]

Marçais G, Kingsford C. A fast, lock-free approach for effi-cient parallel counting of occurrences of k-mers. Bioinformatics. 2011; 27:764-70

[58]

Simao FA, Waterhouse RM, Ioannidis P. et al. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015; 31:3210-2

[59]

Ou SJ, Su W, Liao Y. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehen-sive pipeline. Genome Biol. 2019; 20:275

[60]

Xu D, Yang J, Wen H. et al. CentIER: accurate centromere identi-fication for plant genomes. Plant Communications. 2024; 5:101046

[61]

Ma J, Wing RA, Bennetzen JL. et al. Plant centromere organiza-tion: a dynamic structure with conserved functions. Trends Genet. 2007; 23:134-9

[62]

Goel M, Sun HQ, Jiao WB. et al. SyRI: finding genomic rear-rangements and local sequence differences from whole-genome assemblies. Genome Biol. 2019; 20:277

[63]

Goel M, Schneeberger K. Plotsr: visualizing structural similar-ities and rearrangements between multiple genomes. Bioinfor-matics. 2022; 38:5328

[64]

Hicktey, Monlong J, Ebler J. et al. Pangenome graph construction from genome alignments with Minigraph-cactus. Nat Biotechnol. 2024; 42:663-73

[65]

Siren J, Monlong J, Chang X. et al. Pangenomics enables geno-typing of known structural variants in 5202 diverse genomes. Science. 2021;374:abg8871

[66]

Danecek P, Auton A, Abecasis G. et al. The variant call format and VCFtools. Bioinformatics. 2011; 27:2156-8

[67]

Rausch T, Zichner T, Schlattl A. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioin-formatics. 2012;28:i333-9

[68]

Zhang F, Long R, Ma Z. et al. Evolutionary genomics of climatic adaptation and resilience to climate change in alfalfa. Mol Plant. 2024; 17:867-83

[69]

Purcell S, Neale B, Todd-Brown K. et al. PLINK: a tool set for whole-genome association and population-based linkage anal-yses. Am J Hum Genet. 2007; 81:559-75

[70]

Alexander DH, Novembre J, Lange K. Fast model-based estima-tion of ancestry in unrelated individuals. Genome Res. 2009; 19: 1655-64

[71]

Francis RM. POPHELPER: an R package and web app to analyse and visualize population structure. Mol Ecol Resour. 2017; 17: 27-32

[72]

Wang, Dong W, Liang Y. et al. PhyloForge : unifying micro- and macroevolution with comprehensive genomic signals. Mol Ecol Resour. 2025; 25:e14050

[73]

Emms DM, Kelly S. OrthoFinder: phylogenetic orthology infer-ence for comparative genomics. Genome Biol. 2019; 20:238

[74]

Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013; 30:772-80

[75]

Guindon S, Dufayard J-F, Lefort V. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst Biol. 2010; 59: 307-21

[76]

Letunic I, Bork P. Interactive tree of life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Res. 2021;49:W293-6

[77]

Sheehan S, Harris K, Song YS. Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics. 2013; 194: 647-62

[78]

Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 2018; 14:e1005944

PDF (3658KB)

263

Accesses

0

Citation

Detail

Sections
Recommended

/