Analysis of protein features and machine learning algorithms for prediction of druggable proteins
Tanlin Sun, Luhua Lai, Jianfeng Pei
Analysis of protein features and machine learning algorithms for prediction of druggable proteins
Background: Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models.
Methods: In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA-approved small molecules’ targets, in druggable proteins prediction. We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation.
Results: We found that although the 146-d vector used by Li et al. with neuron network achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved-targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study.
Conclusions: Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on https://github.com/pkumdl/target_prediction.
druggable protein / drug target / word2vec / deep learning
[1] |
The UniProt Consortium. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, D158–D169
CrossRef
Pubmed
Google scholar
|
[2] |
Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z.,
CrossRef
Pubmed
Google scholar
|
[3] |
Butcher, S. P. (2003) Target discovery and validation in the post-genomic era. Neurochem. Res., 28, 367–371
CrossRef
Pubmed
Google scholar
|
[4] |
Dundas, J., Ouyang, Z., Tseng, J., Binkowski, A., Turpaz, Y. and Liang, J. (2006) CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res., 34, W116–W118
CrossRef
Pubmed
Google scholar
|
[5] |
Schmidtke, P., Le Guilloux, V., Maupetit, J. and Tufféry, P. (2010) fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic Acids Res., 38, W582–W589
CrossRef
Pubmed
Google scholar
|
[6] |
Hussein, H. A., Borrel, A., Geneix, C., Petitjean, M., Regad, L. and Camproux, A.-C. (2015) PockDrug-Server: a new web server for predicting pocket druggability on holo and apo proteins. Nucleic Acids Res., 43, W436–W442
CrossRef
Pubmed
Google scholar
|
[7] |
Yuan, Y., Pei, J. and Lai, L. (2013) Binding site detection and druggability prediction of protein targets for structure-based drug design. Curr. Pharm. Des., 19, 2326–2333
CrossRef
Pubmed
Google scholar
|
[8] |
Hajduk, P. J., Huth, J. R. and Fesik, S. W. (2005) Druggability indices for protein targets derived from NMR-based screening data. J. Med. Chem., 48, 2518–2525
CrossRef
Pubmed
Google scholar
|
[9] |
Rose, P. W., Prlić, A., Altunkaya, A., Bi, C., Bradley, A. R., Christie, C. H., Costanzo, L. D., Duarte, J. M., Dutta, S. and Feng, Z. (2016) The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res., 45, D271–D281
Pubmed
|
[10] |
Mitsopoulos, C., Schierz, A. C., Workman, P. and Al-Lazikani, B. (2015) Distinctive behaviors of druggable proteins in cellular networks. PLoS Comput. Biol., 11, e1004597
CrossRef
Pubmed
Google scholar
|
[11] |
Lipinski, C. A. (2004) Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov. Today Technol., 1, 337–341
CrossRef
Pubmed
Google scholar
|
[12] |
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. and Hopkins, A. L. (2012) Quantifying the chemical beauty of drugs. Nat. Chem., 4, 90–98
CrossRef
Pubmed
Google scholar
|
[13] |
Li, Q. and Lai, L. (2007) Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics, 8, 353
CrossRef
Pubmed
Google scholar
|
[14] |
Jamali, A. A., Ferdousi, R., Razzaghi, S., Li, J., Safdari, R. and Ebrahimie, E. (2016) DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. Drug Discov. Today, 21, 718–724
CrossRef
Pubmed
Google scholar
|
[15] |
Guo, Y., Yu, L., Wen, Z. and Li, M. (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res., 36, 3025–3030
CrossRef
Pubmed
Google scholar
|
[16] |
Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y. and Jiang, H. (2007) Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA, 104, 4337–4341
CrossRef
Pubmed
Google scholar
|
[17] |
Asgari, E. and Mofrad, M. R. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287
CrossRef
Pubmed
Google scholar
|
[18] |
Wallach, H. M. (2006) Topic modeling: beyond bag-of-words. In ICML '06 Proceedings of the 23rd International Conference on Machine learning. pp. 977–984, Pittsburgh
|
[19] |
Xue, B., Fu, C. and Shaobin, Z. (2014) A study on sentiment computing and classification of sina weibo with word2vec. In 2014 IEEE International Congress on Big Data. pp. 358–363. Anchorage
|
[20] |
Chung, Y.-A., Wu, C.-C., Shen, C.-H., Lee, H.-Y. and Lee, L.-S. (2016) Audio word2vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv, 1603.00982
|
[21] |
Ngo, D. L., Yamamoto, N., Tran, V. A., Nguyen, N. G., Phan, D., Lumbanraja, F. R., Kubo, M. and Satou, K. (2016) Application of word embedding to drug repositioning. J. Biomed. Sci. Eng., 9, 7–16
CrossRef
Google scholar
|
[22] |
Kimothi, D., Soni, A., Biyani, P. and Hogan, J. M. (2016) Distributed Representations for Biological Sequence Analysis. arXiv:1608.05949
|
[23] |
Vang, Y. S. and Xie, X. (2017) HLA class I binding prediction via convolutional neural networks. Bioinformatics, 33, 2658–2665
CrossRef
Pubmed
Google scholar
|
[24] |
Kanehisa, M. and Goto, S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27–30
CrossRef
Pubmed
Google scholar
|
[25] |
Zeng, Y. H., Guo, Y. Z., Xiao, R. Q., Yang, L., Yu, L. Z. and Li, M. L. (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J. Theor. Biol., 259, 366–372
CrossRef
Pubmed
Google scholar
|
[26] |
Liu, T., Geng, X., Zheng, X., Li, R. and Wang, J. (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids, 42, 2243–2249
CrossRef
Pubmed
Google scholar
|
[27] |
Wang, Y.-C., Wang, X.-B., Yang, Z.-X. and Deng, N.-Y. (2010) Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein Pept. Lett., 17, 1441–1449
CrossRef
Pubmed
Google scholar
|
[28] |
Ottis, P., Toure, M., Cromm, P. M., Ko, E., Gustafson, J. L. and Crews, C. M. (2017) Assessing different E3 ligases for small molecule induced protein ubiquitination and degradation. ACS Chem. Biol., 12, 2570–2578
CrossRef
Pubmed
Google scholar
|
/
〈 | 〉 |