Applications of species accumulation curves in large-scale biological data analysis
Chao Deng, Timothy Daley, Andrew Smith
Applications of species accumulation curves in large-scale biological data analysis
The species accumulation curve, or collector’s curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges. We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical non-parametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45–63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.
species accumulation curve / accumulation region / rational function approximation / immune repertoire / microbiome diversity / species richness
[1] |
Magurran, A. E. (1988). Ecological Diversity and Its Measurement, 168, Princeton: Princeton University Press
|
[2] |
Bunge, J. and Fitzpatrick, M. (1993) Estimating the number of species: A review. J. Am. Stat. Assoc., 88, 364–373
|
[3] |
Colwell, R. K., Mao, C. X. and Chang, J. (2004) Interpolating, extrapolating, and comparing incidence-based species accumulation curves. Ecology, 85, 2717–2727
CrossRef
Google scholar
|
[4] |
Efron, B. and Thisted, R. (1976) Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63, 435–447
|
[5] |
Ionita-Laza, I., Lange, C. and Laird, N. M. (2009) Estimating the number of unseen variants in the human genome. Proc. Natl. Acad. Sci. USA, 106, 5008–5013
CrossRef
Google scholar
|
[6] |
Hughes, J. B., Hellmann, J. J., Ricketts, T. H. and Bohannan, B. J. (2001) Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67, 4399–4406
CrossRef
Google scholar
|
[7] |
Laydon, D. J., Melamed, A., Sim, A., Gillet, N. A., Sim, K., Darko, S., Kroll, J. S., Douek, D. C., Price, D. A., Bangham, C. R., et al. (2014) Quantification of HTLV-1 clonality and TCR diversity. PLoS Comput. Biol., 10, e1003646
CrossRef
Google scholar
|
[8] |
Gotelli, N. J. and Colwell, R. K. (2001) Quantifying biodiversity: Procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett., 4, 379–391
CrossRef
Google scholar
|
[9] |
Colwell, R. K., Chao, A., Gotelli, N. J., Lin, S.-Y., Mao, C. X., Chazdon, R. L. and Longino, J. T. (2012) Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol., 5, 3–21
CrossRef
Google scholar
|
[10] |
Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol., 12, 42–58
CrossRef
Google scholar
|
[11] |
Bulmer, M. (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics, 30, 101–110
CrossRef
Google scholar
|
[12] |
Burrell, Q. L. and Fenton, M. R. (1993) Yes, the GIGP really does work – and is workable! J. Am. Soc. Inf. Sci., 44, 61–69
CrossRef
Google scholar
|
[13] |
Engen, S., (1978). Stochastic Abundance Models. London: Chapman and Hall
|
[14] |
Norris, J. L. and Pollock, K. H. (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ. Ecol. Stat., 5, 391–402
CrossRef
Google scholar
|
[15] |
Wang, J.-P. Z. and Lindsay, B. G. (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J. Am. Stat. Assoc., 100, 942–959
CrossRef
Google scholar
|
[16] |
Mao, C. X., Colwell, R. K. and Chang, J. (2005) Estimating the species accumulation curve using mixtures. Biometrics, 61, 433–441
CrossRef
Google scholar
|
[17] |
Lindsay, B. G. (1983) The geometry of mixture likelihoods: A general theory. Ann. Stat., 11, 86–94
CrossRef
Google scholar
|
[18] |
Wang, J.-P. (2010) Estimating species richness by a Poisson-compound Gamma model. Biometrika, 97, 727–740
CrossRef
Google scholar
|
[19] |
Good, I. and Toulmin, G. (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43, 45–63
CrossRef
Google scholar
|
[20] |
Keating, K. A., Quinn, J. F., Ivie, M. A. and Ivie, L. L. (1998) Estimating the effectiveness of further sampling in species inventories. Ecol. Appl., 8, 1239–1249
|
[21] |
Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327
CrossRef
Google scholar
|
[22] |
Daley, T. and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 3159–3165
CrossRef
Google scholar
|
[23] |
Wang, J.-P. (2011) SPECIES: An R package for species richness estimation. J. Stat. Softw., 40, 1–15
|
[24] |
Mao, C. X. and Lindsay, B. G. (2007) Estimating the number of classes. Ann. Stat., 35, 917–930
CrossRef
Google scholar
|
[25] |
Baker, G. and Graves-Morris, P. (1996). Padé Approximants (Encyclopedia of Mathematics and its Applications)2nd ed., London: Cambridge University Press
|
[26] |
Baker, G. A. Jr. (2000) Defects and the convergence of Padé approximants. Acta Appl. Math., 61, 37–52
CrossRef
Google scholar
|
[27] |
Daley, T. P. (2014). Non-Parametric Models for Large Capture-Recapture Experiments with Applications to DNA Sequencing. Ph.D. thesis, University of Southern California
|
[28] |
Heck, K. L. Jr, van Belle, G. and Simberloff, D. (1975) Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 56, 1459–1461
CrossRef
Google scholar
|
[29] |
Hsieh, T. C., Ma, K. H. and Chao, A. (2013). iNEXT online: interpolation and extrapola-tion [software].
|
[30] |
Bunge, J., Willis, A. and Walsh, F. (2014) Estimating the number of species in microbial diversity studies. Annu. Rev. Stat. Appl., 1, 427–445
CrossRef
Google scholar
|
[31] |
Yatsunenko, T., Rey, F. E., Manary, M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R. N., Anokhin, A. P., et al. (2012) Human gut microbiome viewed across age and geography. Nature, 486, 222–227
|
[32] |
Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E. M., Kubal, M., Paczian, T., Rodriguez, A., Stevens, R., Wilke, A., et al. (2008) The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9, 386
CrossRef
Google scholar
|
[33] |
Britanova, O. V., Putintseva, E. V., Shugay, M., Merzlyak, E. M., Turchaninova, M. A., Staroverov, D. B., Bolotin, D. A., Lukyanov, S., Bogdanova, E. A., Mamedov, I. Z., et al. (2014) Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol., 192, 2689–2698
CrossRef
Google scholar
|
[34] |
Wedderburn, L., Patel, A., Varsani, H. and Woo, P. (2001) The developing human immune system: T-cell receptor repertoire of children and young adults shows a wide discrepancy in the frequency of persistent oligoclonal T-cell expansions. Immunology, 102, 301–309
CrossRef
Google scholar
|
[35] |
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA., 98, 9748–9753
CrossRef
Google scholar
|
[36] |
Compeau, P. E., Pevzner, P. A. and Tesler, G. (2011) How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol., 29, 987–991
CrossRef
Google scholar
|
[37] |
Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., et al. (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience, 2, 1–31
CrossRef
Google scholar
|
[38] |
Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829
CrossRef
Google scholar
|
[39] |
Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770
CrossRef
Google scholar
|
[40] |
Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2015) Inference of markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, doi: 10.1093/bioinformatics/btv395
|
[41] |
Kroes, I., Lepp, P. W. and Relman, D. A. (1999) Bacterial diversity within the human subgingival crevice. Proc. Natl. Acad. Sci. USA., 96, 14547–14552
CrossRef
Google scholar
|
[42] |
Robins, H. S., Campregher, P. V., Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., Warren, E. H. and Carlson, C. S. (2009) Comprehensive assessment of T-cell receptor β-chain diversity in αβ T cells. Blood, 114, 4099–4107
CrossRef
Google scholar
|
[43] |
Colwell, R. K. and Coddington, J. A. (1994) Estimating terrestrial biodiversity through extrapolation. Philos. Trans. R. Soc. Lond. B Biol. Sci., 345, 101–118
CrossRef
Google scholar
|
/
〈 | 〉 |