Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution
Shansong Liu, Kui Hua, Sijie Chen, Xuegong Zhang
Comprehensive simulation of metagenomic sequencing data with non-uniform sampling distribution
Background: Metagenomic sequencing is a complex sampling procedure from unknown mixtures of many genomes. Having metagenome data with known genome compositions is essential for both benchmarking bioinformatics software and for investigating influences of various factors on the data. Compared to data from real microbiome samples or from defined microbial mock community, simulated data with proper computational models are better for the purpose as they provide more flexibility for controlling multiple factors.
Methods: We developed a non-uniform metagenomic sequencing simulation system (nuMetaSim) that is capable of mimicking various factors in real metagenomic sequencing to reflect multiple properties of real data with customizable parameter settings.
Results: We generated 9 comprehensive metagenomic datasets with different composition complexity from of 203 bacterial genomes and 2 archaeal genomes related with human intestine system.
Conclusion: The data can serve as benchmarks for comparing performance of different methods at different situations, and the software package allows users to generate simulation data that can better reflect the specific properties in their scenarios.
simulation / metagenomic sequencing data / non-uniform sampling / nuMetaSim
[1] |
Zhang, X., Liu, S., Cui, H. and Chen, T. (2016) Reading the underlying information from massive metagenomic sequencing data. Proc. IEEE, 105, 459–473
|
[2] |
Raes, J., Foerstner, K. U. and Bork, P. (2007) Get the most out of your metagenome: computational analysis of environmental sequence data. Curr. Opin. Microbiol., 10, 490–498
CrossRef
Pubmed
Google scholar
|
[3] |
Hamady, M. and Knight, R. (2009) Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res., 19, 1141–1152
CrossRef
Pubmed
Google scholar
|
[4] |
Devaraj, S., Hemarajata, P. and Versalovic, J. (2013) The human gut microbiome and body metabolism: implications for obesity and diabetes. Clin. Chem., 59, 617–628
CrossRef
Pubmed
Google scholar
|
[5] |
Turnbaugh, P. J., Hamady, M., Yatsunenko, T., Cantarel, B. L., Duncan, A., Ley, R. E., Sogin, M. L., Jones, W. J., Roe, B. A., Affourtit, J. P.,
CrossRef
Pubmed
Google scholar
|
[6] |
Smith, M. I., Yatsunenko, T., Manary, M. J., Trehan, I., Mkakosya, R., Cheng, J., Kau, A. L., Rich, S. S., Concannon, P., Mychaleckyj, J. C.,
CrossRef
Pubmed
Google scholar
|
[7] |
Lindgreen, S., Adair, K. L. and Gardner, P. P. (2016) An evaluation of the accuracy and speed of metagenome analysis tools. Sci. Rep., 6, 19233
CrossRef
Pubmed
Google scholar
|
[8] |
Bokulich, N. A., Rideout, J. R., Mercurio, W. G., Shiffer, A., Wolfe, B., Maurice, C. F., Dutton, R. J., Turnbaugh, P. J., Knight, R., Caporaso, J. G. (2016) mockrobiota: a public resource for microbiome bioinformatics benchmarking. mSystems 1, e00062-16
CrossRef
Google scholar
|
[9] |
Krohn, A., Stevens, B., Robbins-Pianka, A., Belus, M., Allan, G.J., Gehring, C. (2016) Optimization of 16S amplicon analysis using mock communities: implications for estimating community diversity. PeerJ Preprints
|
[10] |
Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., McHardy, A. C., Rigoutsos, I., Salamov, A., Korzeniewski, F., Land, M.,
CrossRef
Pubmed
Google scholar
|
[11] |
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. and Tyson, G. W. (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res., 25, 1043–1055
CrossRef
Pubmed
Google scholar
|
[12] |
Peabody, M. A., Van Rossum, T., Lo, R. and Brinkman, F. S. (2015) Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16, 362
CrossRef
Pubmed
Google scholar
|
[13] |
Zhou, Q., Su, X. and Ning, K. (2014) Assessment of quality control approaches for metagenomic data analysis. Sci. Rep., 4, 6957
CrossRef
Pubmed
Google scholar
|
[14] |
Mende, D. R., Waller, A. S., Sunagawa, S., Järvelin, A. I., Chan, M. M., Arumugam, M., Raes, J. and Bork, P. (2012) Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS One, 7, e31386
CrossRef
Pubmed
Google scholar
|
[15] |
Randle-Boggis, R. J., Helgason, T., Sapp, M. and Ashton, P. D. (2016) Evaluating techniques for metagenome annotation using simulated sequence data. FEMS Microbiol. Ecol., 92, fiw095
CrossRef
Pubmed
Google scholar
|
[16] |
Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E.,
CrossRef
Pubmed
Google scholar
|
[17] |
Escalona, M., Rocha, S. and Posada, D. (2016) A comparison of tools for the simulation of genomic next-generation sequencing data. Nat. Rev. Genet., 17, 459–469
CrossRef
Pubmed
Google scholar
|
[18] |
Richter, D. C., Ott, F., Auch, A. F., Schmid, R. and Huson, D. H. (2008) MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One, 3, e3373
CrossRef
Pubmed
Google scholar
|
[19] |
Jia, B., Xuan, L., Cai, K., Hu, Z., Ma, L. and Wei, C. (2013) NeSSM: a next-generation sequencing simulator for metagenomics. PLoS One, 8, e75448
CrossRef
Pubmed
Google scholar
|
[20] |
Johnson, S., Trost, B., Long, J. R., Pittet, V. and Kusalik, A. (2014) A better sequence-read simulator program for metagenomics. BMC Bioinformatics, 15, S14
CrossRef
Pubmed
Google scholar
|
[21] |
Shcherbina, A. (2014) FASTQSim: platform-independent data characterization and in silico read generation for NGS datasets. BMC Res. Notes, 7, 533
CrossRef
Pubmed
Google scholar
|
[22] |
Aird, D., Ross, M. G., Chen, W. S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D. B., Nusbaum, C. and Gnirke, A. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18
CrossRef
Pubmed
Google scholar
|
[23] |
Dohm, J. C., Lottaz, C., Borodina, T. and Himmelbauer, H. (2008) Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res., 36, e105
CrossRef
Pubmed
Google scholar
|
[24] |
Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L. and Rice, P. M. (2010) The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res., 38, 1767–1771
CrossRef
Pubmed
Google scholar
|
[25] |
Methé, B. A., Nelson, K. E., Pop, M., Creasy, H. H., Giglio, M. G., Huttenhower, C., Gevers, D., Petrosino, J. F., Abubucker, S., Badger, J. H.,
CrossRef
Pubmed
Google scholar
|
[26] |
Li, J., Jia, H., Cai, X., Zhong, H., Feng, Q., Sunagawa, S., Arumugam, M., Kultima, J. R., Prifti, E., Nielsen, T.,
CrossRef
Pubmed
Google scholar
|
[27] |
Arumugam, M., Raes, J., Pelletier, E., Le Paslier, D., Yamada, T., Mende, D. R., Fernandes, G. R., Tap, J., Bruls, T., Batto, J. M.,
CrossRef
Pubmed
Google scholar
|
[28] |
Meyer, L. R., Zweig, A. S., Hinrichs, A. S., Karolchik, D., Kuhn, R. M., Wong, M., Sloan, C. A., Rosenbloom, K. R., Roe, G., Rhead, B.,
CrossRef
Pubmed
Google scholar
|
[29] |
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T. S., Hinrichs, A., Lu, Y. T., Roskin, K. M., Schwartz, M., Sugnet, C. W., Thomas, D. J.,
CrossRef
Pubmed
Google scholar
|
[30] |
Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res., 8, 186–194
CrossRef
Pubmed
Google scholar
|
[31] |
Schirmer, M., D’Amore, R., Ijaz, U. Z., Hall, N. and Quince, C. (2016) Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data. BMC Bioinformatics, 17, 125
CrossRef
Pubmed
Google scholar
|
[32] |
Moyer, C. L., Dobbs, F. C. and Karl, D. M. (1994) Estimation of diversity and community structure through restriction fragment length polymorphism distribution analysis of bacterial 16S rRNA genes from a microbial mat at an active, hydrothermal vent system, Loihi Seamount, Hawaii. Appl. Environ. Microbiol., 60, 871–879
Pubmed
|
[33] |
Wagner, M. and Loy, A. (2002) Bacterial community composition and function in sewage treatment systems. Curr. Opin. Biotechnol., 13, 218–227
CrossRef
Pubmed
Google scholar
|
[34] |
Turnbaugh, P. J., Ley, R. E., Hamady, M., Fraser-Liggett, C. M., Knight, R. and Gordon, J. I. (2007) The human microbiome project. Nature, 449, 804–810
CrossRef
Pubmed
Google scholar
|
[35] |
Ulrich, W. and Ollik, M. (2004) Frequent and occasional species and the shape of relative-abundance distributions. Divers. Distrib., 10, 263–269
CrossRef
Google scholar
|
[36] |
Hong, S. H., Bunge, J., Jeon, S. O. and Epstein, S. S. (2006) Predicting microbial species richness. Proc. Natl. Acad. Sci. USA, 103, 117–122
CrossRef
Pubmed
Google scholar
|
[37] |
Unterseher, M., Jumpponen, A., Opik, M., Tedersoo, L., Moora, M., Dormann, C. F. and Schnittler, M. (2011) Species abundance distributions and richness estimations in fungal metagenomics–lessons learned from community ecology. Mol. Ecol., 20, 275– 285
CrossRef
Pubmed
Google scholar
|
[38] |
Yang, Y., Chen, N. and Chen, T. (2017) Inference of environmental factor-microbe and microbe-microbe associations from metagenomic data using a hierarchical Bayesian statistical model. Cell Syst., 4, 129–137
CrossRef
Pubmed
Google scholar
|
[39] |
Silva, G. G. Z., Cuevas, D. A., Dutilh, B. E. and Edwards, R. A. (2014) FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ, 2, e425
CrossRef
Pubmed
Google scholar
|
[40] |
Freitas, T. A. K., Li, P. E., Scholz, M. B. and Chain, P. S. (2015) Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res., 43, e69
CrossRef
Pubmed
Google scholar
|
[41] |
Huson, D. H., Auch, A. F., Qi, J. and Schuster, S. C. (2007) MEGAN analysis of metagenomic data. Genome Res., 17, 377–386
CrossRef
Pubmed
Google scholar
|
[42] |
Truong, D. T., Franzosa, E. A., Tickle, T. L., Scholz, M., Weingart, G., Pasolli, E., Tett, A., Huttenhower, C. and Segata, N. (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods, 12, 902–903
CrossRef
Pubmed
Google scholar
|
[43] |
Liu, B., Gibbons, T., Ghodsi, M. and Pop, M. (2010) MetaPhyler: taxonomic profiling for metagenomic sequences. In Bioinformatics and Biomedicine (BIBM), 2010 IEEE International Conference on IEEE, pp. 95–100
|
[44] |
Meinicke, P., Asshauer, K. P. and Lingner, T. (2011) Mixture models for analysis of the taxonomic composition of metagenomes. Bioinformatics, 27, 1618–1624
CrossRef
Pubmed
Google scholar
|
[45] |
Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584
CrossRef
Pubmed
Google scholar
|
[46] |
Comeau, A. M., Douglas, G. M., & Langille, M. G. (2017) Microbiome Helper: a custom and streamlined workflow for microbiome research. mSystems 2, e00127–16.
CrossRef
Google scholar
|
/
〈 | 〉 |