Applications of species accumulation curves in large-scale biological data analysis

Chao Deng, Timothy Daley, Andrew Smith

PDF(667 KB)
PDF(667 KB)
Quant. Biol. ›› 2015, Vol. 3 ›› Issue (3) : 135-144. DOI: 10.1007/s40484-015-0049-7
RESEARCH ARTICLE
RESEARCH ARTICLE

Applications of species accumulation curves in large-scale biological data analysis

Author information +
History +

Abstract

The species accumulation curve, or collector’s curve, of a population gives the expected number of observed species or distinct classes as a function of sampling effort. Species accumulation curves allow researchers to assess and compare diversity across populations or to evaluate the benefits of additional sampling. Traditional applications have focused on ecological populations but emerging large-scale applications, for example in DNA sequencing, are orders of magnitude larger and present new challenges. We developed a method to estimate accumulation curves for predicting the complexity of DNA sequencing libraries. This method uses rational function approximations to a classical non-parametric empirical Bayes estimator due to Good and Toulmin [Biometrika, 1956, 43, 45–63]. Here we demonstrate how the same approach can be highly effective in other large-scale applications involving biological data sets. These include estimating microbial species richness, immune repertoire size, and k-mer diversity for genome assembly applications. We show how the method can be modified to address populations containing an effectively infinite number of species where saturation cannot practically be attained. We also introduce a flexible suite of tools implemented as an R package that make these methods broadly accessible.

Graphical abstract

Keywords

species accumulation curve / accumulation region / rational function approximation / immune repertoire / microbiome diversity / species richness

Cite this article

Download citation ▾
Chao Deng, Timothy Daley, Andrew Smith. Applications of species accumulation curves in large-scale biological data analysis. Quant. Biol., 2015, 3(3): 135‒144 https://doi.org/10.1007/s40484-015-0049-7

References

[1]
Magurran, A. E. (1988). Ecological Diversity and Its Measurement, 168, Princeton: Princeton University Press
[2]
Bunge, J. and Fitzpatrick, M. (1993) Estimating the number of species: A review. J. Am. Stat. Assoc., 88, 364–373
[3]
Colwell, R. K., Mao, C. X. and Chang, J. (2004) Interpolating, extrapolating, and comparing incidence-based species accumulation curves. Ecology, 85, 2717–2727
CrossRef Google scholar
[4]
Efron, B. and Thisted, R. (1976) Estimating the number of unseen species: How many words did Shakespeare know? Biometrika, 63, 435–447
[5]
Ionita-Laza, I., Lange, C. and Laird, N. M. (2009) Estimating the number of unseen variants in the human genome. Proc. Natl. Acad. Sci. USA, 106, 5008–5013
CrossRef Google scholar
[6]
Hughes, J. B., Hellmann, J. J., Ricketts, T. H. and Bohannan, B. J. (2001) Counting the uncountable: Statistical approaches to estimating microbial diversity. Appl. Environ. Microbiol., 67, 4399–4406
CrossRef Google scholar
[7]
Laydon, D. J., Melamed, A., Sim, A., Gillet, N. A., Sim, K., Darko, S., Kroll, J. S., Douek, D. C., Price, D. A., Bangham, C. R., et al. (2014) Quantification of HTLV-1 clonality and TCR diversity. PLoS Comput. Biol., 10, e1003646
CrossRef Google scholar
[8]
Gotelli, N. J. and Colwell, R. K. (2001) Quantifying biodiversity: Procedures and pitfalls in the measurement and comparison of species richness. Ecol. Lett., 4, 379–391
CrossRef Google scholar
[9]
Colwell, R. K., Chao, A., Gotelli, N. J., Lin, S.-Y., Mao, C. X., Chazdon, R. L. and Longino, J. T. (2012) Models and estimators linking individual-based and sample-based rarefaction, extrapolation and comparison of assemblages. J. Plant Ecol., 5, 3–21
CrossRef Google scholar
[10]
Fisher, R. A., Corbet, A. S. and Williams, C. B. (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. J. Anim. Ecol., 12, 42–58
CrossRef Google scholar
[11]
Bulmer, M. (1974) On fitting the Poisson lognormal distribution to species-abundance data. Biometrics, 30, 101–110
CrossRef Google scholar
[12]
Burrell, Q. L. and Fenton, M. R. (1993) Yes, the GIGP really does work – and is workable! J. Am. Soc. Inf. Sci., 44, 61–69
CrossRef Google scholar
[13]
Engen, S., (1978). Stochastic Abundance Models. London: Chapman and Hall
[14]
Norris, J. L. and Pollock, K. H. (1998) Non-parametric MLE for Poisson species abundance models allowing for heterogeneity between species. Environ. Ecol. Stat., 5, 391–402
CrossRef Google scholar
[15]
Wang, J.-P. Z. and Lindsay, B. G. (2005) A penalized nonparametric maximum likelihood approach to species richness estimation. J. Am. Stat. Assoc., 100, 942–959
CrossRef Google scholar
[16]
Mao, C. X., Colwell, R. K. and Chang, J. (2005) Estimating the species accumulation curve using mixtures. Biometrics, 61, 433–441
CrossRef Google scholar
[17]
Lindsay, B. G. (1983) The geometry of mixture likelihoods: A general theory. Ann. Stat., 11, 86–94
CrossRef Google scholar
[18]
Wang, J.-P. (2010) Estimating species richness by a Poisson-compound Gamma model. Biometrika, 97, 727–740
CrossRef Google scholar
[19]
Good, I. and Toulmin, G. (1956) The number of new species, and the increase in population coverage, when a sample is increased. Biometrika, 43, 45–63
CrossRef Google scholar
[20]
Keating, K. A., Quinn, J. F., Ivie, M. A. and Ivie, L. L. (1998) Estimating the effectiveness of further sampling in species inventories. Ecol. Appl., 8, 1239–1249
[21]
Daley, T. and Smith, A. D. (2013) Predicting the molecular complexity of sequencing libraries. Nat. Methods, 10, 325–327
CrossRef Google scholar
[22]
Daley, T. and Smith, A. D. (2014) Modeling genome coverage in single-cell sequencing. Bioinformatics, 30, 3159–3165
CrossRef Google scholar
[23]
Wang, J.-P. (2011) SPECIES: An R package for species richness estimation. J. Stat. Softw., 40, 1–15
[24]
Mao, C. X. and Lindsay, B. G. (2007) Estimating the number of classes. Ann. Stat., 35, 917–930
CrossRef Google scholar
[25]
Baker, G. and Graves-Morris, P. (1996). Padé Approximants (Encyclopedia of Mathematics and its Applications)2nd ed., London: Cambridge University Press
[26]
Baker, G. A. Jr. (2000) Defects and the convergence of Padé approximants. Acta Appl. Math., 61, 37–52
CrossRef Google scholar
[27]
Daley, T. P. (2014). Non-Parametric Models for Large Capture-Recapture Experiments with Applications to DNA Sequencing. Ph.D. thesis, University of Southern California
[28]
Heck, K. L. Jr, van Belle, G. and Simberloff, D. (1975) Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 56, 1459–1461
CrossRef Google scholar
[29]
Hsieh, T. C., Ma, K. H. and Chao, A. (2013). iNEXT online: interpolation and extrapola-tion [software].
[30]
Bunge, J., Willis, A. and Walsh, F. (2014) Estimating the number of species in microbial diversity studies. Annu. Rev. Stat. Appl., 1, 427–445
CrossRef Google scholar
[31]
Yatsunenko, T., Rey, F. E., Manary, M. J., Trehan, I., Dominguez-Bello, M. G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R. N., Anokhin, A. P., et al. (2012) Human gut microbiome viewed across age and geography. Nature, 486, 222–227
[32]
Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E. M., Kubal, M., Paczian, T., Rodriguez, A., Stevens, R., Wilke, A., et al. (2008) The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics, 9, 386
CrossRef Google scholar
[33]
Britanova, O. V., Putintseva, E. V., Shugay, M., Merzlyak, E. M., Turchaninova, M. A., Staroverov, D. B., Bolotin, D. A., Lukyanov, S., Bogdanova, E. A., Mamedov, I. Z., et al. (2014) Age-related decrease in TCR repertoire diversity measured with deep and normalized sequence profiling. J. Immunol., 192, 2689–2698
CrossRef Google scholar
[34]
Wedderburn, L., Patel, A., Varsani, H. and Woo, P. (2001) The developing human immune system: T-cell receptor repertoire of children and young adults shows a wide discrepancy in the frequency of persistent oligoclonal T-cell expansions. Immunology, 102, 301–309
CrossRef Google scholar
[35]
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA., 98, 9748–9753
CrossRef Google scholar
[36]
Compeau, P. E., Pevzner, P. A. and Tesler, G. (2011) How to apply de Bruijn graphs to genome assembly. Nat. Biotechnol., 29, 987–991
CrossRef Google scholar
[37]
Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., Boisvert, S., Chapman, J. A., Chapuis, G., Chikhi, R., et al. (2013) Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience, 2, 1–31
CrossRef Google scholar
[38]
Zerbino, D. R. and Birney, E. (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res., 18, 821–829
CrossRef Google scholar
[39]
Marçais, G. and Kingsford, C. (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27, 764–770
CrossRef Google scholar
[40]
Ren, J., Song, K., Deng, M., Reinert, G., Cannon, C. H. and Sun, F. (2015) Inference of markovian properties of molecular sequences from NGS data and applications to comparative genomics. Bioinformatics, doi: 10.1093/bioinformatics/btv395
[41]
Kroes, I., Lepp, P. W. and Relman, D. A. (1999) Bacterial diversity within the human subgingival crevice. Proc. Natl. Acad. Sci. USA., 96, 14547–14552
CrossRef Google scholar
[42]
Robins, H. S., Campregher, P. V., Srivastava, S. K., Wacher, A., Turtle, C. J., Kahsai, O., Riddell, S. R., Warren, E. H. and Carlson, C. S. (2009) Comprehensive assessment of T-cell receptor β-chain diversity in αβ T cells. Blood, 114, 4099–4107
CrossRef Google scholar
[43]
Colwell, R. K. and Coddington, J. A. (1994) Estimating terrestrial biodiversity through extrapolation. Philos. Trans. R. Soc. Lond. B Biol. Sci., 345, 101–118
CrossRef Google scholar

ACKNOWLEDGEMENTS

This work was funded by NIH grant R01 HG 007650. We would like to thank the Smith’s lab members, Micharl Waterman, Simon Tavaré, and Peter Calabrese for their helpful comments.
COMPLIANCE WITH ETHICS GUIDELINES
Chao Deng, Timothy Daley and Andrew Smith declare they have no conflict of interest.ƒThis article does not contain any studies with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2015 Higher Education Press and Springer-Verlag Berlin Heidelberg
AI Summary AI Mindmap
PDF(667 KB)

Accesses

Citations

Detail

Sections
Recommended

/