Identification of genomic regions distorting population structure inference in diverse continental groups
Qiuxuan Liu, Degang Wu, Chaolong Wang
Identification of genomic regions distorting population structure inference in diverse continental groups
Background: Inference of population structure is crucial for studies of human evolutionary history and genome-wide association studies. While several genomic regions have been reported to distort population structure analysis of European populations, no systematic analysis has been performed on non-European continental groups and with the latest human genome assembly.
Methods: Using the 1000 Genomes Project high coverage whole-genome sequencing data from four major continental groups (Europe, East Asia, South Asia, and Africa), we developed a statistical framework and systematically detected genomic regions with unusual contributions to the inference of population structure for each of the continental groups.
Results: We identified and characterized 27 unusual genomic regions mapped to GRCh38, including 13 regions around centromeres, 2 with chromosomal inversions, 8 under natural selection, and 4 with unknown causes. Excluding these regions would result in a more interpretable population structure inferred by principal components analysis and ADMIXTURE analysis.
Conclusions: Unusual genomic patterns in certain regions can distort the inference of population structure. Our compiled list of these unusual regions will be useful for many population-genetic studies, including those from non-European populations. Availability: The code to reproduce our results is available at the website of Github (/dwuab/UnRegFinder).
We propose a systematical analysis framework based on principal component analysis (PCA) to identify such genomic regions. Based on whole-genome sequencing data from four major continental groups with no recent admixture from the 1000 Genomes Project, we compile a list of 27 unusual genomic regions and demonstrate that excluding these regions can lead to more interpretable population structure results. We recommend removing these regions as a routine in the analysis of population structure to avoid artifact results.
population genetics / population structure / linkage disequilibrium / principal component analysis / natural selection
[1] |
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A. Feldman, M. ( 2002). Genetic structure of human populations. Science, 298 : 2381– 2385
CrossRef
Google scholar
|
[2] |
The, 1000 Genomes Project Consortium ( 2015). A global reference for human genetic variation. Nature, 526 : 68– 74
|
[3] |
Wang, C., llner, S. Rosenberg, N. ( 2012). A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet., 8 : e1002886
CrossRef
Google scholar
|
[4] |
Wu, D., Dou, J., Chai, X., Bellis, C., Wilm, A., Shih, C. C., Soon, W. W. J., Bertin, N., Lin, C. B., Khor, C. C.
CrossRef
Google scholar
|
[5] |
Marchini, J., Cardon, L. R., Phillips, M. S. ( 2004). The effects of human population structure on large genetic association studies. Nat. Genet., 36 : 512– 517
CrossRef
Google scholar
|
[6] |
Price, A. L., Zaitlen, N. A., Reich, D. ( 2010). New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet., 11 : 459– 463
CrossRef
Google scholar
|
[7] |
Chen, H., Wang, C., Conomos, M. P., Stilp, A. M., Li, Z., Sofer, T., Szpiro, A. A., Chen, W., Brehm, J. M.,
CrossRef
Google scholar
|
[8] |
Wang, C., Zhan, X., Bragg-Gresham, J., Kang, H. M., Stambolian, D., Chew, E. Y., Branham, K. E., Heckenlively, J., Fulton, R., Wilson, R. K.
CrossRef
Google scholar
|
[9] |
Wojcik, G. L., Graff, M., Nishimura, K. K., Tao, R., Haessler, J., Gignoux, C. R., Highland, H. M., Patel, Y. M., Sorokin, E. P., Avery, C. L.
CrossRef
Google scholar
|
[10] |
Chen, J., Spracklen, C. N., Marenne, G., Varshney, A., Corbin, L. J., Luan, J., Willems, S. M., Wu, Y., Zhang, X., Horikoshi, M.
CrossRef
Google scholar
|
[11] |
Zhu, C. ( 2009). Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types. Genetics, 182 : 875– 888
CrossRef
Google scholar
|
[12] |
Patterson, N., Price, A. L. ( 2006). Population structure and eigenanalysis. PLoS Genet., 2 : e190
CrossRef
Google scholar
|
[13] |
Wang, C., Zhan, X., Liang, L., Abecasis, G. R. ( 2015). Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. Am. J. Hum. Genet., 96 : 926– 937
CrossRef
Google scholar
|
[14] |
Falush, D., Stephens, M. Pritchard, J. ( 2003). Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164 : 1567– 1587
CrossRef
Google scholar
|
[15] |
Alexander, D. H., Novembre, J. ( 2009). Fast model-based estimation of ancestry in unrelated individuals. Genome Res., 19 : 1655– 1664
CrossRef
Google scholar
|
[16] |
Yang, W. Y., Novembre, J., Eskin, E. ( 2012). A model-based approach for analysis of spatial structure in genetic data. Nat. Genet., 44 : 725– 731
CrossRef
Google scholar
|
[17] |
Jakobsson, M., Scholz, S. W., Scheet, P., Gibbs, J. R., VanLiere, J. M., Fung, H. C., Szpiech, Z. A., Degnan, J. H., Wang, K., Guerreiro, R.
CrossRef
Google scholar
|
[18] |
Tang, H., Choudhry, S., Mei, R., Morgan, M., Rodriguez-Cintron, W., Burchard, E. G. Risch, N. ( 2007). Recent genetic selection in the ancestral admixture of Puerto Ricans. Am. J. Hum. Genet., 81 : 626– 633
CrossRef
Google scholar
|
[19] |
Galinsky, K. J., Bhatia, G., Loh, P. R., Georgiev, S., Mukherjee, S., Patterson, N. J. Price, A. ( 2016). Fast principal-component analysis reveals convergent evolution of ADH1B in Europe and East Asia. Am. J. Hum. Genet., 98 : 456– 472
CrossRef
Google scholar
|
[20] |
Luu, K., lmsson, B. J. Blum, M. G. ( 2020). Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. Mol. Biol. Evol., 37 : 2153– 2154
CrossRef
Google scholar
|
[21] |
Price A. L., Weale M. E., Patterson N., Myers S. R., Need A. C., Shianna K. V., Ge D., Rotter J. I., Torres E., Taylor K. D.,. ( 2008) Long-range LD can confound genome scans in admixed populations. Am. J. Hum. Genet., 83, 132– 135., author reply 135–139
|
[22] |
Lachance, J. Tishkoff, S. ( 2013). SNP ascertainment bias in population genetic analyses: why it is important, and how to correct it. BioEssays, 35 : 780– 786
CrossRef
Google scholar
|
[23] |
Byrska-Bishop, M., Evani, U. S., Zhao, X., Basile, A. O., Abel, H. J., Regier, A. A., Corvelo, A., Clarke, W. E., Musunuri, R., Nagulapalli, K.
CrossRef
Google scholar
|
[24] |
Bersaglieri, T., Sabeti, P. C., Patterson, N., Vanderploeg, T., Schaffner, S. F., Drake, J. A., Rhodes, M., Reich, D. E. Hirschhorn, J. ( 2004). Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet., 74 : 1111– 1120
CrossRef
Google scholar
|
[25] |
Broman K. W., Matsumoto N., Giglio S., Martin C. L., Roseberry J. A., Zuffardi O., Ledbetter D. H. Weber J.. ( 2003) Common long human inversion polymorphism on chromosome 8p. In: Statistics and Science: a Festschrift for Terry Speed. GOLDSTEIN, D. R., pp. 237– 246. Beachwood, OH: Institute of Mathematical Statistics
|
[26] |
Herva, R. ( 1976). A large pericentric inversion of human chromosome 8. Am. J. Hum. Genet., 28 : 208– 212
|
[27] |
Stefansson, H., Helgason, A., Thorleifsson, G., Steinthorsdottir, V., Masson, G., Barnard, J., Baker, A., Jonasdottir, A., Ingason, A., Gudnadottir, V. G.
CrossRef
Google scholar
|
[28] |
Lamason, R. L., Mohideen, M. A., Mest, J. R., Wong, A. C., Norton, H. L., Aros, M. C., Jurynec, M. J., Mao, X., Humphreville, V. R., Humbert, J. E.
CrossRef
Google scholar
|
[29] |
Abdellaoui, A., Hottenga, J. de Knijff, P., Nivard, M. G., Xiao, X., Scheet, P., Brooks, A., Ehli, E. A., Hu, Y., Davies, G. E.
CrossRef
Google scholar
|
[30] |
Smith, A. V., Thomas, D. J., Munro, H. M. Abecasis, G. ( 2005). Sequence features in regions of weak and strong linkage disequilibrium. Genome Res., 15 : 1519– 1534
CrossRef
Google scholar
|
[31] |
Salm, M. P., Horswell, S. D., Hutchison, C. E., Speedy, H. E., Yang, X., Liang, L., Schadt, E. E., Cookson, W. O., Wierzbicki, A. S., Naoumova, R. P.
CrossRef
Google scholar
|
[32] |
Stevison, L. S., Hoehn, K. B. Noor, M. ( 2011). Effects of inversions on within- and between-species recombination and divergence. Genome Biol. Evol., 3 : 830– 841
CrossRef
Google scholar
|
[33] |
Prugnolle, F., Manica, A., Charpentier, M., gan, J. F., Guernier, V. ( 2005). Pathogen-driven selection and worldwide HLA class I diversity. Curr. Biol., 15 : 1022– 1027
CrossRef
Google scholar
|
[34] |
Watson, C. T. ( 2012). The immunoglobulin heavy chain locus: genetic variation, missing data, and implications for human disease. Genes Immun., 13 : 363– 373
CrossRef
Google scholar
|
[35] |
Yang, Z., Zhong, H., Chen, J., Zhang, X., Zhang, H., Luo, X., Xu, S., Chen, H., Lu, D., Han, Y.
CrossRef
Google scholar
|
[36] |
Jarvis, J. P., Scheinfeldt, L. B., Soi, S., Lambert, C., Omberg, L., Ferwerda, B., Froment, A., Bodo, J. M., Beggs, W., Hoffman, G.
CrossRef
Google scholar
|
[37] |
Climer, S., Templeton, A. R. ( 2015). Human gephyrin is encompassed within giant functional noncoding yin-yang sequences. Nat. Commun., 6 : 6534
CrossRef
Google scholar
|
[38] |
Ameur, A., Enroth, S., Johansson, A., Zaboli, G., Igl, W., Johansson, A. C. V., Rivas, M. A., Daly, M. J., Schmitz, G., Hicks, A. A.
CrossRef
Google scholar
|
[39] |
Mathieson, S. ( 2018). FADS1 and the timing of human adaptation to agriculture. Mol. Biol. Evol., 35 : 2957– 2970
CrossRef
Google scholar
|
[40] |
Hudjashov, G., Villems, R. ( 2013). Global patterns of diversity and selection in human tyrosinase gene. PLoS One, 8 : e74307
CrossRef
Google scholar
|
[41] |
Lao, O., de Gruijter, J. M., van Duijn, K., Navarro, A. ( 2007). Signatures of positive selection in genes associated with human skin pigmentation as revealed from analyses of single nucleotide polymorphisms. Ann. Hum. Genet., 71 : 354– 369
CrossRef
Google scholar
|
[42] |
Giannuzzi, G., Siswara, P., Malig, M., Marques-Bonet, T., Mullikin, J. C., Ventura, M. Eichler, E. E. ( 2013). Evolutionary dynamism of the primate LRRC37 gene family. Genome Res., 23 : 46– 59
CrossRef
Google scholar
|
[43] |
Lee, Y. R., Yuan, W. C., Ho, H. C., Chen, C. H., Shih, H. M. Chen, R. ( 2010). The Cullin 3 substrate adaptor KLHL20 mediates DAPK ubiquitination to control interferon responses. EMBO J., 29 : 1748– 1761
CrossRef
Google scholar
|
[44] |
Burkardt, D. D., Rosenfeld, J. A., Helgeson, M. L., Angle, B., Banks, V., Smith, W. E., Gripp, K. W., Moline, J., Moran, R. T., Niyazov, D. M.
CrossRef
Google scholar
|
[45] |
Bustamante Rivera, Y. Y., tting, C., Schmidt, C., Volkmer, I. Staege, M. ( 2018). Endogenous retrovirus 3—history, physiology, and pathology. Front. Microbiol., 8 : 2691
CrossRef
Google scholar
|
[46] |
Lacombe, J., Rishavy, M. A., Berkner, K. L. ( 2018). VKOR paralog VKORC1L1 supports vitamin K-dependent protein carboxylation in vivo. JCI Insight, 3 : e96501
CrossRef
Google scholar
|
[47] |
Szpak, M., Mezzavilla, M., Ayub, Q., Chen, Y., Xue, Y. ( 2018). FineMAV: prioritizing candidate genetic variants driving local adaptations in human populations. Genome Biol., 19 : 5
CrossRef
Google scholar
|
[48] |
Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M. Lee, J. ( 2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4 : 7
CrossRef
Google scholar
|
[49] |
Frichot, E. ( 2015). LEA: An R package for landscape and ecological association studies. Methods Ecol. Evol., 6 : 925– 929
CrossRef
Google scholar
|
[50] |
Devlin, B. ( 1999). Genomic control for association studies. Biometrics, 55 : 997– 1004
CrossRef
Google scholar
|
[51] |
Šidák,
CrossRef
Google scholar
|
[52] |
Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A. ( 2015). Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Mol. Ecol. Resour., 15 : 1179– 1191
CrossRef
Google scholar
|
/
〈 | 〉 |