Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution

Shansong Liu; Kui Hua; Sijie Chen; Xuegong Zhang

doi:10.1007/s40484-018-0142-9

PDF(1142 KB)

Quant. Biol. ›› 2018, Vol. 6 ›› Issue (2) : 175-185. DOI: 10.1007/s40484-018-0142-9

METHODOLOGY ARTICLE

Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution

Author information +

History +

Abstract

Background: Metagenomic sequencing is a complex sampling procedure from unknown mixtures of many genomes. Having metagenome data with known genome compositions is essential for both benchmarking bioinformatics software and for investigating influences of various factors on the data. Compared to data from real microbiome samples or from defined microbial mock community, simulated data with proper computational models are better for the purpose as they provide more flexibility for controlling multiple factors.

Methods: We developed a non-uniform metagenomic sequencing simulation system (nuMetaSim) that is capable of mimicking various factors in real metagenomic sequencing to reflect multiple properties of real data with customizable parameter settings.

Results: We generated 9 comprehensive metagenomic datasets with different composition complexity from of 203 bacterial genomes and 2 archaeal genomes related with human intestine system.

Conclusion: The data can serve as benchmarks for comparing performance of different methods at different situations, and the software package allows users to generate simulation data that can better reflect the specific properties in their scenarios.

Graphical abstract

Keywords

simulation / metagenomic sequencing data / non-uniform sampling / nuMetaSim

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Shansong Liu, Kui Hua, Sijie Chen, Xuegong Zhang. Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution. Quant. Biol., 2018, 6(2): 175‒185 https://doi.org/10.1007/s40484-018-0142-9

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zhang, X., Liu, S., Cui, H. and Chen, T. (2016) Reading the underlying information from massive metagenomic sequencing data. Proc. IEEE, 105, 459–473

[2]	Raes, J., Foerstner, K. U. and Bork, P. (2007) Get the most out of your metagenome: computational analysis of environmental sequence data. Curr. Opin. Microbiol., 10, 490–498 CrossRef Pubmed Google scholar

[3]	Hamady, M. and Knight, R. (2009) Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res., 19, 1141–1152 CrossRef Pubmed Google scholar

[4]	Devaraj, S., Hemarajata, P. and Versalovic, J. (2013) The human gut microbiome and body metabolism: implications for obesity and diabetes. Clin. Chem., 59, 617–628 CrossRef Pubmed Google scholar

[5]	Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel, B. L., Duncan, A., Ley, R. E., Sogin, M. L., Jones, W. J., Roe, B. A., Affourtit, J. P., (2009) A core gut microbiome in obese and lean twins. Nature, 457, 480–484 CrossRef Pubmed Google scholar

[6]	Smith, M. I., Yatsunenko, T., Manary, M. J., Trehan, I., Mkakosya, R., Cheng, J., Kau, A. L., Rich, S. S., Concannon, P., Mychaleckyj, J. C., (2013) Gut microbiomes of Malawian twin pairs discordant for kwashiorkor. Science, 339, 548–554 CrossRef Pubmed Google scholar

[7]	Lindgreen, S., Adair, K. L. and Gardner, P. P. (2016) An evaluation of the accuracy and speed of metagenome analysis tools. Sci. Rep., 6, 19233 CrossRef Pubmed Google scholar

[8]	Bokulich, N. A., Rideout, J. R., Mercurio, W. G., Shiffer, A., Wolfe, B., Maurice, C. F., Dutton, R. J., Turnbaugh, P. J., Knight, R., Caporaso, J. G. (2016) mockrobiota: a public resource for microbiome bioinformatics benchmarking. mSystems 1, e00062-16 CrossRef Google scholar

[9]	Krohn, A., Stevens, B., Robbins-Pianka, A., Belus, M., Allan, G.J., Gehring, C. (2016) Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity. PeerJ Preprints

[10]	Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., McHardy, A. C., Rigoutsos, I., Salamov, A., Korzeniewski, F., Land, M., (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods, 4, 495–500 CrossRef Pubmed Google scholar

[11]	Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. and Tyson, G. W. (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res., 25, 1043–1055 CrossRef Pubmed Google scholar

[12]	Peabody, M. A., Van Rossum, T., Lo, R. and Brinkman, F. S. (2015) Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16, 362 CrossRef Pubmed Google scholar

[13]	Zhou, Q., Su, X. and Ning, K. (2014) Assessment of quality control approaches for metagenomic data analysis. Sci. Rep., 4, 6957 CrossRef Pubmed Google scholar

[14]	Mende, D. R., Waller, A. S., Sunagawa, S., Järvelin, A. I., Chan, M. M., Arumugam, M., Raes, J. and Bork, P. (2012) Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS One, 7, e31386 CrossRef Pubmed Google scholar

[15]	Randle-Boggis, R. J., Helgason, T., Sapp, M. and Ashton, P. D. (2016) Evaluating techniques for metagenome annotation using simulated sequence data. FEMS Microbiol. Ecol., 92, fiw095 CrossRef Pubmed Google scholar

[16]	Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E., (2017) Critical assessment of metagenome interpretation–a benchmark of computational metagenomics software-Nat. Methods, 14, 1063–1071 CrossRef Pubmed Google scholar

[17]	Escalona, M., Rocha, S. and Posada, D. (2016) A comparison of tools for the simulation of genomic next-generation sequencing data. Nat. Rev. Genet., 17, 459–469 CrossRef Pubmed Google scholar

[18]	Richter, D. C., Ott, F., Auch, A. F., Schmid, R. and Huson, D. H. (2008) MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One, 3, e3373 CrossRef Pubmed Google scholar

[19]	Jia, B., Xuan, L., Cai, K., Hu, Z., Ma, L. and Wei, C. (2013) NeSSM: a next-generation sequencing simulator for metagenomics. PLoS One, 8, e75448 CrossRef Pubmed Google scholar

[20]	Johnson, S., Trost, B., Long, J. R., Pittet, V. and Kusalik, A. (2014) A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15, S14 CrossRef Pubmed Google scholar

[21]	Shcherbina, A. (2014) FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets. BMC Res. Notes, 7, 533 CrossRef Pubmed Google scholar

[22]	Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D. B., Nusbaum, C. and Gnirke, A. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18 CrossRef Pubmed Google scholar

[23]	Dohm, J. C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36, e105 CrossRef Pubmed Google scholar

[24]	Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. and Rice, P. M. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res., 38, 1767–1771 CrossRef Pubmed Google scholar

[25]	Methé, B. A., Nelson, K. E., Pop, M., Creasy, H. H., Giglio, M. G., Huttenhower, C., Gevers, D., Petrosino, J. F., Abubucker, S., Badger, J. H., (2012) A framework for human microbiome research. Nature, 486, 215–221 CrossRef Pubmed Google scholar

[26]	Li, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J. R., Prifti, E., Nielsen, T., (2014) An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol., 32, 834–841 CrossRef Pubmed Google scholar

[27]	Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R., Fernandes, G. R., Tap, J., Bruls, T., Batto, J. M., (2011) Enterotypes of the human gut microbiome. Nature, 473, 174–180 CrossRef Pubmed Google scholar

[28]	Meyer, L. R., Zweig, A. S., Hinrichs, A. S., Karolchik, D., Kuhn, R. M., Wong, M., Sloan, C. A., Rosenbloom, K. R., Roe, G., Rhead, B., (2013) The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res., 41, D64–D69 CrossRef Pubmed Google scholar

[29]	Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., (2003) The UCSC genome browser database. Nucleic Acids Res., 31, 51–54 CrossRef Pubmed Google scholar

[30]	Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res., 8, 186–194 CrossRef Pubmed Google scholar

[31]	Schirmer, M., D’Amore, R., Ijaz, U. Z., Hall, N. and Quince, C. (2016) Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics, 17, 125 CrossRef Pubmed Google scholar

[32]

Moyer, C. L., Dobbs, F. C. and Karl, D. M. (1994) Estimation of diversity and community structure through restriction fragment length polymorphism distribution analysis of bacterial 16S rRNA genes from a microbial mat at an active, hydrothermal vent system, Loihi Seamount, Hawaii. Appl. Environ. Microbiol., 60, 871–879

Pubmed

[33]	Wagner, M. and Loy, A. (2002) Bacterial community composition and function in sewage treatment systems. Curr. Opin. Biotechnol., 13, 218–227 CrossRef Pubmed Google scholar

[34]	Turnbaugh, P. J., Ley, R. E., Hamady, M., Fraser-Liggett, C. M., Knight, R. and Gordon, J. I. (2007) The human microbiome project. Nature, 449, 804–810 CrossRef Pubmed Google scholar

[35]	Ulrich, W. and Ollik, M. (2004) Frequent and occasional species and the shape of relative-abundance distributions. Divers. Distrib., 10, 263–269 CrossRef Google scholar

[36]	Hong, S. H., Bunge, J., Jeon, S. O. and Epstein, S. S. (2006) Predicting microbial species richness. Proc. Natl. Acad. Sci. USA, 103, 117–122 CrossRef Pubmed Google scholar

[37]	Unterseher, M., Jumpponen, A., Opik, M., Tedersoo, L., Moora, M., Dormann, C. F. and Schnittler, M. (2011) Species abundance distributions and richness estimations in fungal metagenomics–lessons learned from community ecology. Mol. Ecol., 20, 275– 285 CrossRef Pubmed Google scholar

[38]	Yang, Y., Chen, N. and Chen, T. (2017) Inference of environmental factor-microbe and microbe-microbe associations from metagenomic data using a hierarchical Bayesian statistical model. Cell Syst., 4, 129–137 CrossRef Pubmed Google scholar

[39]	Silva, G. G. Z., Cuevas, D. A., Dutilh, B. E. and Edwards, R. A. (2014) FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ, 2, e425 CrossRef Pubmed Google scholar

[40]	Freitas, T. A. K., Li, P. E., Scholz, M. B. and Chain, P. S. (2015) Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res., 43, e69 CrossRef Pubmed Google scholar

[41]	Huson, D. H., Auch, A. F., Qi, J. and Schuster, S. C. (2007) MEGAN analysis of metagenomic data. Genome Res., 17, 377–386 CrossRef Pubmed Google scholar

[42]	Truong, D. T., Franzosa, E. A., Tickle, T. L., Scholz, M., Weingart, G., Pasolli, E., Tett, A., Huttenhower, C. and Segata, N. (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods, 12, 902–903 CrossRef Pubmed Google scholar

[43]	Liu, B., Gibbons, T., Ghodsi, M. and Pop, M. (2010) MetaPhyler: taxonomic profiling for metagenomic sequences. In Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on IEEE, pp. 95–100

[44]	Meinicke, P., Asshauer, K. P. and Lingner, T. (2011) Mixture models for analysis of the taxonomic composition of metagenomes. Bioinformatics, 27, 1618–1624 CrossRef Pubmed Google scholar

[45]	Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584 CrossRef Pubmed Google scholar

[46]	Comeau, A. M., Douglas, G. M., & Langille, M. G. (2017) Microbiome Helper: a custom and streamlined workflow for microbiome research. mSystems 2, e00127–16. CrossRef Google scholar

ACKNOWLEDGEMENTS

We thank Dr. Hongfei Cui for her comments on the simulation design. This work is partially supported by the National Natural Science Foundation of China (Nos. 61673231 and 61721003).

COMPLIANCE WITH ETHICS GUIDELINES

The authors Shansong Liu, Kui Hua, Sijie Chen and Xuegong Zhang declare that they have no conﬂict of interests.This article does not contain any studies with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2018 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap

PDF(1142 KB)

Accesses

Citations

Detail

Sections

Recommended

Received	Revised	Accepted	Published
08 Nov 2017	07 Feb 2018	27 Feb 2018	11 Jun 2018
Online First Date	Issue Date
29 May 2018	11 Jun 2018

About the journal

Aims & scopes

Description

Editorial board

Abstracting / Indexing

Cover gallery

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Call for papers

Editorial policy

Guidelines for authors

Download templates

Classifications via endnote

Guidelines for reviewers

Author FAQs