WEDeepT3: predicting type III secreted effectors based on word embedding and deep learning
Xiaofeng Fu, Yang Yang
WEDeepT3: predicting type III secreted effectors based on word embedding and deep learning
Background: The type III secreted effectors (T3SEs) are one of the indispensable proteins in the growth and reproduction of Gram-negative bacteria. In particular, the pathogenesis of Gram-negative bacteria depends on the type III secreted effectors, and by injecting T3SEs into a host cell, the host cell’s immunity can be destroyed. The high diversity of T3SE sequences and the lack of defined secretion signals make it difficult to identify and predict. Moreover, the related study of the pathological system associated with T3SE remains a hot topic in bioinformatics. Some computational tools have been developed to meet the growing demand for the recognition of T3SEs and the studies of type III secretion systems (T3SS). Although these tools can help biological experiments in certain procedures, there is still room for improvement, even for the current best model, as the existing methods adopt hand-designed feature and traditional machine learning methods.
Methods: In this study, we propose a powerful predictor based on deep learning methods, called WEDeepT3. Our work consists mainly of three key steps. First, we train word embedding vectors for protein sequences in a large-scale amino acid sequence database. Second, we combine the word vectors with traditional features extracted from protein sequences, like PSSM, to construct a more comprehensive feature representation. Finally, we construct a deep neural network model in the prediction of type III secreted effectors.
Results: The feature representation of WEDeepT3 consists of both word embedding and position-specific features. Working together with convolutional neural networks, the new model achieves superior performance to the state-of-the-art methods, demonstrating the effectiveness of the new feature representation and the powerful learning ability of deep models.
Conclusion: WEDeepT3 exploits both semantic information of k-mer fragments and evolutional information of protein sequences to accurately differentiate between T3SEs and non-T3SEs. WEDeepT3 is available at bcmi.sjtu.edu.cn/~yangyang/WEDeepT3.html.
type III secreted effectors / word2vector / PSSM / feature representation
[1] |
Galán, J. E. and Wolf-Watz, H. (2006) Protein delivery into eukaryotic cells by type III secretion machines. Nature, 444, 567–573
CrossRef
Pubmed
Google scholar
|
[2] |
He, S. Y., Nomura, K. and Whittam, T. S. (2004) Type III protein secretion mechanism in mammalian and plant pathogens. Biochim. Biophys. Acta, 1694, 181–206
CrossRef
Pubmed
Google scholar
|
[3] |
Cornelis, G. R. (2006) The type III secretion injectisome. Nat. Rev. Microbiol., 4, 811–825
CrossRef
Pubmed
Google scholar
|
[4] |
Brodsky, I. E. and Medzhitov, R. (2009) Targeting of immune signalling networks by bacterial pathogens. Nat. Cell Biol., 11, 521–526
CrossRef
Pubmed
Google scholar
|
[5] |
Dean, P. (2011) Functional domains and motifs of bacterial type III effector proteins and their roles in infection. FEMS Microbiol. Rev., 35, 1100–1125
CrossRef
Pubmed
Google scholar
|
[6] |
Guttman, D. S., McHardy, A. C. and Schulze-Lefert, P. (2014) Microbial genome-enabled insights into plant-microorganism interactions. Nat. Rev. Genet., 15, 797–813
CrossRef
Pubmed
Google scholar
|
[7] |
Yang, Y., Zhao, J., Morgan, R. L., Ma, W. and Jiang, T. (2010) Computational prediction of type III secreted proteins from gram-negative bacteria. BMC Bioinformatics, 11, S47
CrossRef
Pubmed
Google scholar
|
[8] |
Yang, Y. and Qi, S. (2014) A new feature selection method for computational prediction of type III secreted effectors. Int. J. Data Min. Bioinform., 10, 440–454
CrossRef
Pubmed
Google scholar
|
[9] |
Fu, X., Xiao , Y. and Yang, Y. (2018) Prediction of Type III Secreted Effectors Based on Word Embeddings for Protein Sequences. In: Bioinformatics Research and Applications, Zhang, F., Cai, Z., Skums, P., Zhang, S. (eds). Lecture Notes in Computer Science, vol 10847. Springer, Cham
|
[10] |
Tay, D. M., Govindarajan, K. R., Khan, A. M., Ong, T. Y., Samad, H. M., Soh, W. W., Tong, M., Zhang, F. and Tan, T. W. (2010) T3SEdb: data warehousing of virulence effectors secreted by the bacterial Type III Secretion System. BMC Bioinformatics, 11, S4
CrossRef
Pubmed
Google scholar
|
[11] |
Wang, Y., Huang, H., Sun, M., Zhang, Q. and Guo, D. (2012) T3DB: an integrated database for bacterial type III secretion system. BMC Bioinformatics, 13, 66
CrossRef
Pubmed
Google scholar
|
[12] |
Wang, Y., Zhang, Q., Sun, M. A. and Guo, D. (2011) High-accuracy prediction of bacterial type III secreted effectors based on position-specific amino acid composition profiles. Bioinformatics, 27, 777–784
CrossRef
Pubmed
Google scholar
|
[13] |
Dong, X., Lu, X. and Zhang, Z. (2015) Bean 2.0: an integrated web resource for the identification and functional analysis of type III secreted effectors. Database, 2015, bav064
|
[14] |
Goldberg, T., Rost, B. and Bromberg, Y. (2016) Computational prediction shines light on type III secretion origins. Sci. Rep., 6, 34516
CrossRef
Pubmed
Google scholar
|
[15] |
Mikolov, T., Chen, K., Corrado, G. and Dean, J. (2013) Efficient estimation of word representations in vector space. arXiv: 1301.3781
|
[16] |
Jehl, M.-A., Arnold, R. and Rattei, T. (2011) Effective—a database of predicted secreted bacterial proteins. Nucleic Acids Res., 39, D591–D595
CrossRef
Pubmed
Google scholar
|
[17] |
Li, W. and Godzik, A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659
CrossRef
Pubmed
Google scholar
|
[18] |
Dong, X., Zhang, Y.-J. and Zhang, Z. (2013) Using weakly conserved motifs hidden in secretion signals to identify type-III effectors from bacterial pathogen genomes. PLoS One, 8, e56632
CrossRef
Pubmed
Google scholar
|
[19] |
Chou, K. C. and Com, M. P. (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43, 246–255
CrossRef
Pubmed
Google scholar
|
[20] |
Arnold, R., Brandmaier, S., Kleine, F., Tischler, P., Heinz, E., Behrens, S., Niinikoski, A., Mewes, H. W., Horn, M. and Rattei, T. (2009) Sequence-based prediction of type III secreted proteins. PLoS Pathog., 5, e1000376
CrossRef
Pubmed
Google scholar
|
[21] |
Wang, Y., Sun, M., Bao, H. and White, A. P. (2013) T3_MM: a Markov model effectively classifies bacterial type III secretion signals. PLoS One, 8, e58173
CrossRef
Pubmed
Google scholar
|
[22] |
Xue, L., Tang, B., Chen, W. and Luo, J. (2019) DeepT3: deep convolutional neural networks accurately identify Gram-negative bacterial type III secreted effectors using the N-terminal sequence. Bioinformatics, 35, 2051–2057
CrossRef
Pubmed
Google scholar
|
[23] |
Wang, J., Li, J., Yang, B., Xie, R., Marquez-Lago, T. T., Leier, A., Hayashida, M., Akutsu, T., Zhang, Y., Chou, K.-C.,
CrossRef
Pubmed
Google scholar
|
[24] |
Maaten, L. d. and Hinton, G. (2008) Visualizing data using t-sne. J. Mach. Learn. Res., 9, 2579–2605
|
[25] |
Klein-Seetharaman, J., Reddy, R. (2002) Biological language modeling: Convergence of computational linguistics and biological chemistry. In: Converging Technologies for Improving Human Performance, pp. 378, Springer
|
[26] |
Asgari, E. and Mofrad, M. R. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287
CrossRef
Pubmed
Google scholar
|
[27] |
Pennington, J., Socher, R. and Manning, C. (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543
|
[28] |
Deng, W., Marshall, N. C., Rowland, J. L., McCoy, J. M., Worrall, L. J., Santos, A. S., Strynadka, N. C. J. and Finlay, B. B. (2017) Assembly, structure, function and regulation of type III secretion systems. Nat. Rev. Microbiol., 15, 323–337
CrossRef
Pubmed
Google scholar
|
[29] |
Altschul, S. F. and Koonin, E. V. (1998) Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem. Sci., 23, 444–447
CrossRef
Pubmed
Google scholar
|
[30] |
Zuo, Y. C., Chen, W., Fan, G. L. and Li, Q. Z. (2013) A similarity distance of diversity measure for discriminating mesophilic and thermophilic proteins. Amino Acids, 44, 573–580
CrossRef
Pubmed
Google scholar
|
[31] |
Zuo, Y. C. and Li, Q. Z. (2009) Using reduced amino acid composition to predict defensin family and subfamily: Integrating similarity measure and structural alphabet. Peptides, 30, 1788–1793
CrossRef
Pubmed
Google scholar
|
[32] |
Jeong, J. C., Lin, X. and Chen, X. W. (2011) On position-specific scoring matrix for protein function prediction. IEEE/ACM Trans. Comput. Biol. Bioinformatics, 8, 308–315
CrossRef
Pubmed
Google scholar
|
[33] |
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P. (2017) Focal loss for dense object detection. IEEE T. Pattern Anal. Mach. Intell., 99, 2999–3007
|
/
〈 | 〉 |