
Identification of genomic regions distorting population structure inference in diverse continental groups
Qiuxuan Liu, Degang Wu, Chaolong Wang
Quant. Biol. ›› 2022, Vol. 10 ›› Issue (3) : 287-298.
Identification of genomic regions distorting population structure inference in diverse continental groups
Background: Inference of population structure is crucial for studies of human evolutionary history and genome-wide association studies. While several genomic regions have been reported to distort population structure analysis of European populations, no systematic analysis has been performed on non-European continental groups and with the latest human genome assembly.
Methods: Using the 1000 Genomes Project high coverage whole-genome sequencing data from four major continental groups (Europe, East Asia, South Asia, and Africa), we developed a statistical framework and systematically detected genomic regions with unusual contributions to the inference of population structure for each of the continental groups.
Results: We identified and characterized 27 unusual genomic regions mapped to GRCh38, including 13 regions around centromeres, 2 with chromosomal inversions, 8 under natural selection, and 4 with unknown causes. Excluding these regions would result in a more interpretable population structure inferred by principal components analysis and ADMIXTURE analysis.
Conclusions: Unusual genomic patterns in certain regions can distort the inference of population structure. Our compiled list of these unusual regions will be useful for many population-genetic studies, including those from non-European populations. Availability: The code to reproduce our results is available at the website of Github (/dwuab/UnRegFinder).
We propose a systematical analysis framework based on principal component analysis (PCA) to identify such genomic regions. Based on whole-genome sequencing data from four major continental groups with no recent admixture from the 1000 Genomes Project, we compile a list of 27 unusual genomic regions and demonstrate that excluding these regions can lead to more interpretable population structure results. We recommend removing these regions as a routine in the analysis of population structure to avoid artifact results.
population genetics / population structure / linkage disequilibrium / principal component analysis / natural selection
[1] |
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A. Feldman, M. ( 2002). Genetic structure of human populations. Science, 298 : 2381– 2385
CrossRef
Google scholar
|
[2] |
The, 1000 Genomes Project Consortium ( 2015). A global reference for human genetic variation. Nature, 526 : 68– 74
|
[3] |
Wang, C., llner, S. Rosenberg, N. ( 2012). A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet., 8 : e1002886
CrossRef
Google scholar
|
[4] |
Wu, D., Dou, J., Chai, X., Bellis, C., Wilm, A., Shih, C. C., Soon, W. W. J., Bertin, N., Lin, C. B., Khor, C. C.
CrossRef
Google scholar
|
[5] |
Marchini, J., Cardon, L. R., Phillips, M. S. ( 2004). The effects of human population structure on large genetic association studies. Nat. Genet., 36 : 512– 517
CrossRef
Google scholar
|
[6] |
Price, A. L., Zaitlen, N. A., Reich, D. ( 2010). New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet., 11 : 459– 463
CrossRef
Google scholar
|
[7] |
Chen, H., Wang, C., Conomos, M. P., Stilp, A. M., Li, Z., Sofer, T., Szpiro, A. A., Chen, W., Brehm, J. M.,
CrossRef
Google scholar
|
[8] |
Wang, C., Zhan, X., Bragg-Gresham, J., Kang, H. M., Stambolian, D., Chew, E. Y., Branham, K. E., Heckenlively, J., Fulton, R., Wilson, R. K.
CrossRef
Google scholar
|
[9] |
Wojcik, G. L., Graff, M., Nishimura, K. K., Tao, R., Haessler, J., Gignoux, C. R., Highland, H. M., Patel, Y. M., Sorokin, E. P., Avery, C. L.
CrossRef
Google scholar
|
[10] |
Chen, J., Spracklen, C. N., Marenne, G., Varshney, A., Corbin, L. J., Luan, J., Willems, S. M., Wu, Y., Zhang, X., Horikoshi, M.
CrossRef
Google scholar
|
[11] |
Zhu, C. ( 2009). Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types. Genetics, 182 : 875– 888
CrossRef
Google scholar
|
[12] |
Patterson, N., Price, A. L. ( 2006). Population structure and eigenanalysis. PLoS Genet., 2 : e190
CrossRef
Google scholar
|
[13] |
Wang, C., Zhan, X., Liang, L., Abecasis, G. R. ( 2015). Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am. J. Hum. Genet., 96 : 926– 937
CrossRef
Google scholar
|
[14] |
Falush, D., Stephens, M. Pritchard, J. ( 2003). Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164 : 1567– 1587
CrossRef
Google scholar
|
[15] |
Alexander, D. H., Novembre, J. ( 2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res., 19 : 1655– 1664
CrossRef
Google scholar
|
[16] |
Yang, W. Y., Novembre, J., Eskin, E. ( 2012). A model-based approach for analysis of spatial structure in genetic data. Nat. Genet., 44 : 725– 731
CrossRef
Google scholar
|
[17] |
Jakobsson, M., Scholz, S. W., Scheet, P., Gibbs, J. R., VanLiere, J. M., Fung, H. C., Szpiech, Z. A., Degnan, J. H., Wang, K., Guerreiro, R.
CrossRef
Google scholar
|
[18] |
Tang, H., Choudhry, S., Mei, R., Morgan, M., Rodriguez-Cintron, W., Burchard, E. G. Risch, N. ( 2007). Recent genetic selection in the ancestral admixture of Puerto Ricans. Am. J. Hum. Genet., 81 : 626– 633
CrossRef
Google scholar
|
[19] |
Galinsky, K. J., Bhatia, G., Loh, P. R., Georgiev, S., Mukherjee, S., Patterson, N. J. Price, A. ( 2016). Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet., 98 : 456– 472
CrossRef
Google scholar
|
[20] |
Luu, K., lmsson, B. J. Blum, M. G. ( 2020). Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. Mol. Biol. Evol., 37 : 2153– 2154
CrossRef
Google scholar
|
[21] |
Price A. L., Weale M. E., Patterson N., Myers S. R., Need A. C., Shianna K. V., Ge D., Rotter J. I., Torres E., Taylor K. D.,. ( 2008) Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet., 83, 132– 135., author reply 135–139
|
[22] |
Lachance, J. Tishkoff, S. ( 2013). SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. BioEssays, 35 : 780– 786
CrossRef
Google scholar
|
[23] |
Byrska-Bishop, M., Evani, U. S., Zhao, X., Basile, A. O., Abel, H. J., Regier, A. A., Corvelo, A., Clarke, W. E., Musunuri, R., Nagulapalli, K.
CrossRef
Google scholar
|
[24] |
Bersaglieri, T., Sabeti, P. C., Patterson, N., Vanderploeg, T., Schaffner, S. F., Drake, J. A., Rhodes, M., Reich, D. E. Hirschhorn, J. ( 2004). Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet., 74 : 1111– 1120
CrossRef
Google scholar
|
[25] |
Broman K. W., Matsumoto N., Giglio S., Martin C. L., Roseberry J. A., Zuffardi O., Ledbetter D. H. Weber J.. ( 2003) Common long human inversion polymorphism on chromosome 8p. In: Statistics and Science: a Festschrift for Terry Speed. GOLDSTEIN, D. R., pp. 237– 246. Beachwood, OH: Institute of Mathematical Statistics
|
[26] |
Herva, R. ( 1976). A large pericentric inversion of human chromosome 8. Am. J. Hum. Genet., 28 : 208– 212
|
[27] |
Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdottir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Ingason, A., Gudnadottir, V. G.
CrossRef
Google scholar
|
[28] |
Lamason, R. L., Mohideen, M. A., Mest, J. R., Wong, A. C., Norton, H. L., Aros, M. C., Jurynec, M. J., Mao, X., Humphreville, V. R., Humbert, J. E.
CrossRef
Google scholar
|
[29] |
Abdellaoui, A., Hottenga, J. de Knijff, P., Nivard, M. G., Xiao, X., Scheet, P., Brooks, A., Ehli, E. A., Hu, Y., Davies, G. E.
CrossRef
Google scholar
|
[30] |
Smith, A. V., Thomas, D. J., Munro, H. M. Abecasis, G. ( 2005). Sequence features in regions of weak and strong linkage disequilibrium. Genome Res., 15 : 1519– 1534
CrossRef
Google scholar
|
[31] |
Salm, M. P., Horswell, S. D., Hutchison, C. E., Speedy, H. E., Yang, X., Liang, L., Schadt, E. E., Cookson, W. O., Wierzbicki, A. S., Naoumova, R. P.
CrossRef
Google scholar
|
[32] |
Stevison, L. S., Hoehn, K. B. Noor, M. ( 2011). Effects of inversions on within- and between-species recombination and divergence. Genome Biol. Evol., 3 : 830– 841
CrossRef
Google scholar
|
[33] |
Prugnolle, F., Manica, A., Charpentier, M., gan, J. F., Guernier, V. ( 2005). Pathogen-driven selection and worldwide HLA class I diversity. Curr. Biol., 15 : 1022– 1027
CrossRef
Google scholar
|
[34] |
Watson, C. T. ( 2012). The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease. Genes Immun., 13 : 363– 373
CrossRef
Google scholar
|
[35] |
Yang, Z., Zhong, H., Chen, J., Zhang, X., Zhang, H., Luo, X., Xu, S., Chen, H., Lu, D., Han, Y.
CrossRef
Google scholar
|
[36] |
Jarvis, J. P., Scheinfeldt, L. B., Soi, S., Lambert, C., Omberg, L., Ferwerda, B., Froment, A., Bodo, J. M., Beggs, W., Hoffman, G.
CrossRef
Google scholar
|
[37] |
Climer, S., Templeton, A. R. ( 2015). Human gephyrin is encompassed within giant functional noncoding yin-yang sequences. Nat. Commun., 6 : 6534
CrossRef
Google scholar
|
[38] |
Ameur, A., Enroth, S., Johansson, A., Zaboli, G., Igl, W., Johansson, A. C. V., Rivas, M. A., Daly, M. J., Schmitz, G., Hicks, A. A.
CrossRef
Google scholar
|
[39] |
Mathieson, S. ( 2018). FADS1 and the timing of human adaptation to agriculture. Mol. Biol. Evol., 35 : 2957– 2970
CrossRef
Google scholar
|
[40] |
Hudjashov, G., Villems, R. ( 2013). Global patterns of diversity and selection in human tyrosinase gene. PLoS One, 8 : e74307
CrossRef
Google scholar
|
[41] |
Lao, O., de Gruijter, J. M., van Duijn, K., Navarro, A. ( 2007). Signatures of positive selection in genes associated with human skin pigmentation as revealed from analyses of single nucleotide polymorphisms. Ann. Hum. Genet., 71 : 354– 369
CrossRef
Google scholar
|
[42] |
Giannuzzi, G., Siswara, P., Malig, M., Marques-Bonet, T., Mullikin, J. C., Ventura, M. Eichler, E. E. ( 2013). Evolutionary dynamism of the primate LRRC37 gene family. Genome Res., 23 : 46– 59
CrossRef
Google scholar
|
[43] |
Lee, Y. R., Yuan, W. C., Ho, H. C., Chen, C. H., Shih, H. M. Chen, R. ( 2010). The Cullin 3 substrate adaptor KLHL20 mediates DAPK ubiquitination to control interferon responses. EMBO J., 29 : 1748– 1761
CrossRef
Google scholar
|
[44] |
Burkardt, D. D., Rosenfeld, J. A., Helgeson, M. L., Angle, B., Banks, V., Smith, W. E., Gripp, K. W., Moline, J., Moran, R. T., Niyazov, D. M.
CrossRef
Google scholar
|
[45] |
Bustamante Rivera, Y. Y., tting, C., Schmidt, C., Volkmer, I. Staege, M. ( 2018). Endogenous retrovirus 3—history, physiology, and pathology. Front. Microbiol., 8 : 2691
CrossRef
Google scholar
|
[46] |
Lacombe, J., Rishavy, M. A., Berkner, K. L. ( 2018). VKOR paralog VKORC1L1 supports vitamin K-dependent protein carboxylation in vivo. JCI Insight, 3 : e96501
CrossRef
Google scholar
|
[47] |
Szpak, M., Mezzavilla, M., Ayub, Q., Chen, Y., Xue, Y. ( 2018). FineMAV: prioritizing candidate genetic variants driving local adaptations in human populations. Genome Biol., 19 : 5
CrossRef
Google scholar
|
[48] |
Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M. Lee, J. ( 2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 : 7
CrossRef
Google scholar
|
[49] |
Frichot, E. ( 2015). LEA: An R package for landscape and ecological association studies. Methods Ecol. Evol., 6 : 925– 929
CrossRef
Google scholar
|
[50] |
Devlin, B. ( 1999). Genomic control for association studies. Biometrics, 55 : 997– 1004
CrossRef
Google scholar
|
[51] |
Šidák,
CrossRef
Google scholar
|
[52] |
Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A. ( 2015). Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Mol. Ecol. Resour., 15 : 1179– 1191
CrossRef
Google scholar
|
/
〈 |
|
〉 |