Identifying viruses from metagenomic data using deep learning
Jie Ren, Kai Song, Chao Deng, Nathan A. Ahlgren, Jed A. Fuhrman, Yi Li, Xiaohui Xie, Ryan Poplin, Fengzhu Sun
Identifying viruses from metagenomic data using deep learning
Background: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data.
Methods: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning.
Results: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC.
Conclusions: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
metagenome / deep learning / virus identification / machine learning
[1] |
Norman, J. M., Handley, S. A., Baldridge, M. T., Droit, L., Liu, C. Y., Keller, B. C., Kambal, A., Monaco, C. L., Zhao, G., Fleshner, P.,
CrossRef
Pubmed
Google scholar
|
[2] |
Reyes, A., Blanton, L. V., Cao, S., Zhao, G., Manary, M., Trehan, I., Smith, M. I., Wang, D., Virgin, H. W., Rohwer, F.,
CrossRef
Pubmed
Google scholar
|
[3] |
Ma, Y., You, X., Mai, G., Tokuyasu, T. and Liu, C. (2018) A human gut phage catalog correlates the gut phageome with type 2 diabetes. Microbiome, 6, 24
CrossRef
Pubmed
Google scholar
|
[4] |
Roux, S., Enault, F., Hurwitz, B. L. and Sullivan, M. B. (2015) VirSorter: mining viral signal from microbial genomic data. PeerJ, 3, e985
CrossRef
Pubmed
Google scholar
|
[5] |
Ren, J., Ahlgren, N. A., Lu, Y. Y., Fuhrman, J. A. and Sun, F. (2017) VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome, 5, 69
CrossRef
Pubmed
Google scholar
|
[6] |
Amgarten, D., Braga, L. P. P., da Silva, A. M. and Setubal, J. C. (2018) Marvel, a tool for prediction of bacteriophage sequences in metagenomic bins. Front. Genet., 9, 304
CrossRef
Pubmed
Google scholar
|
[7] |
Roux, S., Faubladier, M., Mahul, A., Paulhe, N., Bernard, A., Debroas, D. and Enault, F. (2011) Metavir: a web server dedicated to virome analysis. Bioinformatics, 27, 3074–3075
CrossRef
Pubmed
Google scholar
|
[8] |
Rampelli, S., Soverini, M., Turroni, S., Quercia, S., Biagi, E., Brigidi, P. and Candela, M. (2016) ViromeScan: a new tool for metagenomic viral community profiling. BMC Genomics, 17, 165
CrossRef
Pubmed
Google scholar
|
[9] |
Wommack, K. E., Bhavsar, J., Polson, S. W., Chen, J., Dumas, M., Srinivasiah, S., Furman, M., Jamindar, S. and Nasko, D. J. (2012) VIROME: a standard operating procedure for analysis of viral metagenome sequences. Stand. Genomic Sci., 6, 427–439
CrossRef
Pubmed
Google scholar
|
[10] |
Wood, D. E. and Salzberg, S. L. (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol., 15, R46
CrossRef
Pubmed
Google scholar
|
[11] |
Kim, D., Song, L., Breitwieser, F. P. and Salzberg, S. L. (2016) Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res., 26, 1721–1729
CrossRef
Pubmed
Google scholar
|
[12] |
Truong, D. T., Franzosa, E. A., Tickle, T. L., Scholz, M., Weingart, G., Pasolli, E., Tett, A., Huttenhower, C. and Segata, N. (2015) MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods, 12, 902–903
CrossRef
Pubmed
Google scholar
|
[13] |
Buchfink, B., Xie, C. and Huson, D. H. (2015) Fast and sensitive protein alignment using DIAMOND. Nat. Methods, 12, 59–60
CrossRef
Pubmed
Google scholar
|
[14] |
Fouts, D. E. (2006) Phage_Finder: automated identification and classification of prophage regions in complete bacterial genome sequences. Nucleic Acids Res., 34, 5839–5851
CrossRef
Pubmed
Google scholar
|
[15] |
Lima-Mendez, G., Van Helden, J., Toussaint, A. and Leplae, R. (2008) Prophinder: a computational tool for prophage prediction in prokaryotic genomes. Bioinformatics, 24, 863–865
CrossRef
Pubmed
Google scholar
|
[16] |
Akhter, S., Aziz, R. K. and Edwards, R. A. (2012) PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res., 40, e126
CrossRef
Pubmed
Google scholar
|
[17] |
Arndt, D., Grant, J. R., Marcu, A., Sajed, T., Pon, A., Liang, Y. and Wishart, D. S. (2016) PHASTER: a better, faster version of the PHAST phage search tool. Nucleic Acids Res., 44, W16–W 21
CrossRef
Pubmed
Google scholar
|
[18] |
Roux, S., Hallam, S. J., Woyke, T. and Sullivan, M. B. (2015) Viral dark matter and virus-host interactions resolved from publicly available microbial genomes. eLife, 4, e08490
CrossRef
Pubmed
Google scholar
|
[19] |
Paez-Espino, D., Pavlopoulos, G. A., Ivanova, N. N. and Kyrpides, N. C. (2017) Nontargeted virus sequence discovery pipeline and virus clustering for metagenomic data. Nat. Protoc., 12, 1673–1682
CrossRef
Pubmed
Google scholar
|
[20] |
Alipanahi, B., Delong, A., Weirauch, M. T. and Frey, B. J. (2015) Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33, 831–838
CrossRef
Pubmed
Google scholar
|
[21] |
Zeng, H., Edwards, M. D., Liu, G. and Gifford, D. K. (2016) Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics, 32, i121–i127
CrossRef
Pubmed
Google scholar
|
[22] |
Quang, D. and Xie, X. (2019) Factornet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods,166, 40–47
|
[23] |
Wang, M., Tai, C., E, W. and Wei, L. (2018) DeFine: deep convolutional neural networks accurately quantify intensities of transcription factor-DNA binding and facilitate evaluation of functional non-coding variants. Nucleic Acids Res., 46, e69
CrossRef
Pubmed
Google scholar
|
[24] |
Zhou, J. and Troyanskaya, O. G. (2015) Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods, 12, 931–934
CrossRef
Pubmed
Google scholar
|
[25] |
Quang, D. and Xie, X. (2016) DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res., 44, e107
CrossRef
Pubmed
Google scholar
|
[26] |
Kelley, D. R., Snoek, J. and Rinn, J. L. (2016) Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res., 26, 990–999
CrossRef
Pubmed
Google scholar
|
[27] |
Poplin, R., Chang, P.-C., Alexander, D., Schwartz, S., Colthurst, T., Ku, A., Newburger, D., Dijamco, J., Nguyen, N., Afshar, P. T.,
CrossRef
Pubmed
Google scholar
|
[28] |
Zeng, H. and Gifford, D. K. (2017) Predicting the impact of non-coding variants on DNA methylation. Nucleic Acids Res., 45, e99
CrossRef
Pubmed
Google scholar
|
[29] |
Li, Y., Quang, D. and Xie, X. (2017) Understanding sequence conservation with deep learning. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 400–406. ACM
|
[30] |
Li, Y., Shi, W. and Wasserman, W. W. (2018) Genome-wide prediction of cis-regulatory regions using supervised deep learning methods. BMC Bioinformatics. 19, 202
|
[31] |
Singh, S., Yang, Y., Poczos, B. and Ma, J. (2019) Predicting enhancer-promoter interaction from genomic sequence with deep neural networks. Quant. Biol. 7, 122–137
CrossRef
Google scholar
|
[32] |
Yue, T. and Wang, H. (2018) Deep learning for genomics: A concise overview. arXiv:1802.00810
|
[33] |
Lauring, A. S., Frydman, J. and Andino, R. (2013) The role of mutational robustness in RNA virus evolution. Nat. Rev. Microbiol., 11, 327–336
CrossRef
Pubmed
Google scholar
|
[34] |
Glenn, T. C. (2011) Field guide to next-generation DNA sequencers. Mol. Ecol. Resour., 11, 759–769
CrossRef
Pubmed
Google scholar
|
[35] |
World Health Organization. (2014) World Cancer Report 2014. Stewart, B., Wild, C. P., eds., IAIC
|
[36] |
Hawk, E.T. and Levin, B. (2016) Colorectal cancer prevention. J. Clinic. Oncolo. 23, 378–391
|
[37] |
Feng, Q., Liang, S., Jia, H., Stadlmayr, A., Tang, L., Lan, Z., Zhang, D., Xia, H., Xu, X., Jie, Z.,
CrossRef
Pubmed
Google scholar
|
[38] |
Vogtmann, E., Hua, X., Zeller, G., Sunagawa, S., Voigt, A. Y., Hercog, R., Goedert, J. J., Shi, J., Bork, P. and Sinha, R. (2016) Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS One, 11, e0155362
CrossRef
Pubmed
Google scholar
|
[39] |
Nakatsu, G., Li, X., Zhou, H., Sheng, J., Wong, S. H., Wu, W. K. K., Ng, S. C., Tsoi, H., Dong, Y., Zhang, N.,
CrossRef
Pubmed
Google scholar
|
[40] |
Zeller, G., Tap, J., Voigt, A. Y., Sunagawa, S., Kultima, J. R., Costea, P. I., Amiot, A., Böhm, J., Brunetti, F., Habermann, N.,
CrossRef
Pubmed
Google scholar
|
[41] |
Lu, Y. Y., Chen, T., Fuhrman, J. A. and Sun, F. (2017) COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment and paired-end read LinkAge. Bioinformatics, 33, 791–798
Pubmed
|
[42] |
Dutilh, B. E., Cassman, N., McNair, K., Sanchez, S. E., Silva, G. G., Boling, L., Barr, J. J., Speth, D. R., Seguritan, V., Aziz, R. K.,
CrossRef
Pubmed
Google scholar
|
[43] |
El-Gebali, S., Mistry, J., Bateman, A., Eddy, S. R., Luciani, A., Potter, S. C., Qureshi, M., Richardson, L. J., Salazar, G. A., Smart, A.,
Pubmed
|
[44] |
Zheng, T., Li, J., Ni, Y., Kang, K., Misiakou, M.-A., Imamovic, L., Chow, B. K. C., Rode, A. A., Bytzer, P., Sommer, M.,
CrossRef
Pubmed
Google scholar
|
[45] |
Edwards, R. A., McNair, K., Faust, K., Raes, J. and Dutilh, B. E. (2016) Computational approaches to predict bacteriophage-host relationships. FEMS Microbiol. Rev., 40, 258–272
CrossRef
Pubmed
Google scholar
|
[46] |
Ahlgren, N. A., Ren, J., Lu, Y. Y., Fuhrman, J. A. and Sun, F. (2017) Alignment-free d2* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res., 45, 39–53
CrossRef
Pubmed
Google scholar
|
[47] |
Gouy, M. and Gautier, C. (1982) Codon usage in bacteria: correlation with gene expressivity. Nucleic Acids Res., 10, 7055–7074
CrossRef
Pubmed
Google scholar
|
[48] |
Sharp, P. M., Rogers, M. S. and McConnell, D. J. (1985) Selection pressures on codon usage in the complete genome of bacteriophage T7. J. Mol. Evol., 21, 150–160
CrossRef
Pubmed
Google scholar
|
[49] |
Pride, D. T., Wassenaar, T. M., Ghose, C. and Blaser, M. J. (2006) Evidence of host-virus co-evolution in tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics, 7, 8
CrossRef
Pubmed
Google scholar
|
[50] |
Carbone, A. (2008) Codon bias is a major factor explaining phage evolution in translationally biased hosts. J. Mol. Evol., 66, 210–223
CrossRef
Pubmed
Google scholar
|
[51] |
Ponsero, A. J. and Hurwitz, B. L. (2019) The promises and pitfalls of machine learning for detecting viruses in aquatic metagenomes. Front. Microbiol., 10, 806
CrossRef
Pubmed
Google scholar
|
[52] |
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J. and Man’e, D. (2016) Concrete problems in AI safety. arXiv:1606.06565
|
[53] |
Hendrycks, D. and Gimpel, K. A (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. In: Proceedings of International Conference on Learning Representations 2017. Toulon
|
[54] |
Lakshminarayanan, B., Pritzel, A. and Blundell, C. (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In: Proceedings of Advances in Neural Information Processing Systems, pp. 6402–6413
|
[55] |
Liang, S., Li, Y. and Srikant, R. (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv:1706.02690
|
[56] |
Hendrycks, D., Mazeika, M. and Dietterich, T. G. (2018) Deep anomaly detection with outlier exposure. arXiv:1812.04606
|
[57] |
Shafaei, A., Schmidt, M. and Little, J. J. (2018) Does your model know the digit 6 is not a cat? a less biased evaluation of outlier detectors. arXiv:1809.04729
|
[58] |
Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., DePristo, M. A., Dillon, J. V. and Lakshminarayanan, B. (2019) Likelihood ratios for out-of-distribution detection. arXiv:1906.02845
|
[59] |
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B. and Snoek, J. (2019) Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. arXiv:1906.02530
|
[60] |
Nalisnick, E., Matsukawa, A., Teh, Y. W. and Lakshminarayanan, B. (2019) Detecting out-of-distribution inputs to deep generative models using a test for typicality. arXiv:1906.02994
|
[61] |
Kingma, D. P. and Ba, J. (2015) Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations. San Diego
|
[62] |
Minot, S., Sinha, R., Chen, J., Li, H., Keilbaugh, S. A., Wu, G. D., Lewis, J. D. and Bushman, F. D. (2011) The human gut virome: inter-individual variation and dynamic response to diet. Genome Res., 21, 1616–1625
CrossRef
Pubmed
Google scholar
|
[63] |
Roux, S., Brum, J. R., Dutilh, B. E., Sunagawa, S., Duhaime, M. B., Loy, A., Poulos, B. T., Solonenko, N., Lara, E., Poulain, J.,
CrossRef
Pubmed
Google scholar
|
[64] |
Fang, Z., Tan, J., Wu, S., Li, M., Xu, C., Xie, Z. and Zhu, H. (2019) PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. Gigascience, 8, giz066
CrossRef
Pubmed
Google scholar
|
/
〈 | 〉 |