PDF
(1142KB)
Abstract
Background: Metagenomic sequencing is a complex sampling procedure from unknown mixtures of many genomes. Having metagenome data with known genome compositions is essential for both benchmarking bioinformatics software and for investigating influences of various factors on the data. Compared to data from real microbiome samples or from defined microbial mock community, simulated data with proper computational models are better for the purpose as they provide more flexibility for controlling multiple factors.
Methods: We developed a non-uniform metagenomic sequencing simulation system (nuMetaSim) that is capable of mimicking various factors in real metagenomic sequencing to reflect multiple properties of real data with customizable parameter settings.
Results: We generated 9 comprehensive metagenomic datasets with different composition complexity from of 203 bacterial genomes and 2 archaeal genomes related with human intestine system.
Conclusion: The data can serve as benchmarks for comparing performance of different methods at different situations, and the software package allows users to generate simulation data that can better reflect the specific properties in their scenarios.
Graphical abstract
Keywords
simulation
/
metagenomic sequencing data
/
non-uniform sampling
/
nuMetaSim
Cite this article
Download citation ▾
Shansong Liu, Kui Hua, Sijie Chen, Xuegong Zhang.
Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution.
Quant. Biol., 2018, 6(2): 175-185 DOI:10.1007/s40484-018-0142-9
| [1] |
Zhang, X., Liu, S., Cui, H. and Chen, T. (2016) Reading the underlying information from massive metagenomic sequencing data. Proc. IEEE, 105, 459–473
|
| [2] |
Raes, J., Foerstner, K. U. and Bork, P. (2007) Get the most out of your metagenome: computational analysis of environmental sequence data. Curr. Opin. Microbiol., 10, 490–498
|
| [3] |
Hamady, M. and Knight, R. (2009) Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res., 19, 1141–1152
|
| [4] |
Devaraj, S., Hemarajata, P. and Versalovic, J. (2013) The human gut microbiome and body metabolism: implications for obesity and diabetes. Clin. Chem., 59, 617–628
|
| [5] |
Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel, B. L., Duncan, A., Ley, R. E., Sogin, M. L., Jones, W. J., Roe, B. A., Affourtit, J. P., (2009) A core gut microbiome in obese and lean twins. Nature, 457, 480–484
|
| [6] |
Smith, M. I., Yatsunenko, T., Manary, M. J., Trehan, I., Mkakosya, R., Cheng, J., Kau, A. L., Rich, S. S., Concannon, P., Mychaleckyj, J. C., (2013) Gut microbiomes of Malawian twin pairs discordant for kwashiorkor. Science, 339, 548–554
|
| [7] |
Lindgreen, S., Adair, K. L. and Gardner, P. P. (2016) An evaluation of the accuracy and speed of metagenome analysis tools. Sci. Rep., 6, 19233
|
| [8] |
Bokulich, N. A., Rideout, J. R., Mercurio, W. G., Shiffer, A., Wolfe, B., Maurice, C. F., Dutton, R. J., Turnbaugh, P. J., Knight, R., Caporaso, J. G. (2016) mockrobiota: a public resource for microbiome bioinformatics benchmarking. mSystems 1, e00062-16
|
| [9] |
Krohn, A., Stevens, B., Robbins-Pianka, A., Belus, M., Allan, G.J., Gehring, C. (2016) Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity. PeerJ Preprints
|
| [10] |
Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., McHardy, A. C., Rigoutsos, I., Salamov, A., Korzeniewski, F., Land, M., (2007) Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods, 4, 495–500
|
| [11] |
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. and Tyson, G. W. (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res., 25, 1043–1055
|
| [12] |
Peabody, M. A., Van Rossum, T., Lo, R. and Brinkman, F. S. (2015) Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16, 362
|
| [13] |
Zhou, Q., Su, X. and Ning, K. (2014) Assessment of quality control approaches for metagenomic data analysis. Sci. Rep., 4, 6957
|
| [14] |
Mende, D. R., Waller, A. S., Sunagawa, S., Järvelin, A. I., Chan, M. M., Arumugam, M., Raes, J. and Bork, P. (2012) Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS One, 7, e31386
|
| [15] |
Randle-Boggis, R. J., Helgason, T., Sapp, M. and Ashton, P. D. (2016) Evaluating techniques for metagenome annotation using simulated sequence data. FEMS Microbiol. Ecol., 92, fiw095
|
| [16] |
Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E., (2017) Critical assessment of metagenome interpretation–a benchmark of computational metagenomics software-Nat. Methods, 14, 1063–1071
|
| [17] |
Escalona, M., Rocha, S. and Posada, D. (2016) A comparison of tools for the simulation of genomic next-generation sequencing data. Nat. Rev. Genet., 17, 459–469
|
| [18] |
Richter, D. C., Ott, F., Auch, A. F., Schmid, R. and Huson, D. H. (2008) MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One, 3, e3373
|
| [19] |
Jia, B., Xuan, L., Cai, K., Hu, Z., Ma, L. and Wei, C. (2013) NeSSM: a next-generation sequencing simulator for metagenomics. PLoS One, 8, e75448
|
| [20] |
Johnson, S., Trost, B., Long, J. R., Pittet, V. and Kusalik, A. (2014) A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15, S14
|
| [21] |
Shcherbina, A. (2014) FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets. BMC Res. Notes, 7, 533
|
| [22] |
Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D. B., Nusbaum, C. and Gnirke, A. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18
|
| [23] |
Dohm, J. C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36, e105
|
| [24] |
Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. and Rice, P. M. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res., 38, 1767–1771
|
| [25] |
Methé B. A., Nelson, K. E., Pop, M., Creasy, H. H., Giglio, M. G., Huttenhower, C., Gevers, D., Petrosino, J. F., Abubucker, S., Badger, J. H., (2012) A framework for human microbiome research. Nature, 486, 215–221
|
| [26] |
Li, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J. R., Prifti, E., Nielsen, T., (2014) An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol., 32, 834–841
|
| [27] |
Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R., Fernandes, G. R., Tap, J., Bruls, T., Batto, J. M., (2011) Enterotypes of the human gut microbiome. Nature, 473, 174–180
|
| [28] |
Meyer, L. R., Zweig, A. S., Hinrichs, A. S., Karolchik, D., Kuhn, R. M., Wong, M., Sloan, C. A., Rosenbloom, K. R., Roe, G., Rhead, B., (2013) The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res., 41, D64–D69
|
| [29] |
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J., (2003) The UCSC genome browser database. Nucleic Acids Res., 31, 51–54
|
| [30] |
Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res., 8, 186–194
|
| [31] |
Schirmer, M., D’Amore, R., Ijaz, U. Z., Hall, N. and Quince, C. (2016) Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics, 17, 125
|
| [32] |
Moyer, C. L., Dobbs, F. C. and Karl, D. M. (1994) Estimation of diversity and community structure through restriction fragment length polymorphism distribution analysis of bacterial 16S rRNA genes from a microbial mat at an active, hydrothermal vent system, Loihi Seamount, Hawaii. Appl. Environ. Microbiol., 60, 871–879
|
| [33] |
Wagner, M. and Loy, A. (2002) Bacterial community composition and function in sewage treatment systems. Curr. Opin. Biotechnol., 13, 218–227
|
| [34] |
Turnbaugh, P. J., Ley, R. E., Hamady, M., Fraser-Liggett, C. M., Knight, R. and Gordon, J. I. (2007) The human microbiome project. Nature, 449, 804–810
|
| [35] |
Ulrich, W. and Ollik, M. (2004) Frequent and occasional species and the shape of relative-abundance distributions. Divers. Distrib., 10, 263–269
|
| [36] |
Hong, S. H., Bunge, J., Jeon, S. O. and Epstein, S. S. (2006) Predicting microbial species richness. Proc. Natl. Acad. Sci. USA, 103, 117–122
|
| [37] |
Unterseher, M., Jumpponen, A., Opik, M., Tedersoo, L., Moora, M., Dormann, C. F. and Schnittler, M. (2011) Species abundance distributions and richness estimations in fungal metagenomics–lessons learned from community ecology. Mol. Ecol., 20, 275– 285
|
| [38] |
Yang, Y., Chen, N. and Chen, T. (2017) Inference of environmental factor-microbe and microbe-microbe associations from metagenomic data using a hierarchical Bayesian statistical model. Cell Syst., 4, 129–137
|
| [39] |
Silva, G. G. Z., Cuevas, D. A., Dutilh, B. E. and Edwards, R. A. (2014) FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ, 2, e425
|
| [40] |
Freitas, T. A. K., Li, P. E., Scholz, M. B. and Chain, P. S. (2015) Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res., 43, e69
|
| [41] |
Huson, D. H., Auch, A. F., Qi, J. and Schuster, S. C. (2007) MEGAN analysis of metagenomic data. Genome Res., 17, 377–386
|
| [42] |
Truong, D. T., Franzosa, E. A., Tickle, T. L., Scholz, M., Weingart, G., Pasolli, E., Tett, A., Huttenhower, C. and Segata, N. (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods, 12, 902–903
|
| [43] |
Liu, B., Gibbons, T., Ghodsi, M. and Pop, M. (2010) MetaPhyler: taxonomic profiling for metagenomic sequences. In Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on IEEE, pp. 95–100
|
| [44] |
Meinicke, P., Asshauer, K. P. and Lingner, T. (2011) Mixture models for analysis of the taxonomic composition of metagenomes. Bioinformatics, 27, 1618–1624
|
| [45] |
Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584
|
| [46] |
Comeau, A. M., Douglas, G. M., & Langille, M. G. (2017) Microbiome Helper: a custom and streamlined workflow for microbiome research. mSystems 2, e00127–16.
|
RIGHTS & PERMISSIONS
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature