Analysis of protein features and machine learning algorithms for prediction of druggable proteins

Tanlin Sun , Luhua Lai , Jianfeng Pei

Quant. Biol. ›› 2018, Vol. 6 ›› Issue (4) : 334 -343.

PDF (1196KB)
Quant. Biol. ›› 2018, Vol. 6 ›› Issue (4) : 334 -343. DOI: 10.1007/s40484-018-0157-2
RESEARCH ARTICLE
RESEARCH ARTICLE

Analysis of protein features and machine learning algorithms for prediction of druggable proteins

Author information +
History +
PDF (1196KB)

Abstract

Background: Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models.

Methods: In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA-approved small molecules’ targets, in druggable proteins prediction. We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation.

Results: We found that although the 146-d vector used by Li et al. with neuron network achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved-targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study.

Conclusions: Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on https://github.com/pkumdl/target_prediction.

Graphical abstract

Keywords

druggable protein / drug target / word2vec / deep learning

Cite this article

Download citation ▾
Tanlin Sun, Luhua Lai, Jianfeng Pei. Analysis of protein features and machine learning algorithms for prediction of druggable proteins. Quant. Biol., 2018, 6(4): 334-343 DOI:10.1007/s40484-018-0157-2

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

The UniProt Consortium. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, D158–D169

[2]

Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res., 46, D1074–D1082

[3]

Butcher, S. P. (2003) Target discovery and validation in the post-genomic era. Neurochem. Res., 28, 367–371

[4]

Dundas, J., Ouyang, Z., Tseng, J., Binkowski, A., Turpaz, Y. and Liang, J. (2006) CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res., 34, W116–W118

[5]

Schmidtke, P., Le Guilloux, V., Maupetit, J. and Tufféry, P. (2010) fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic Acids Res., 38, W582–W589

[6]

Hussein, H. A., Borrel, A., Geneix, C., Petitjean, M., Regad, L. and Camproux, A.-C. (2015) PockDrug-Server: a new web server for predicting pocket druggability on holo and apo proteins. Nucleic Acids Res., 43, W436–W442

[7]

Yuan, Y., Pei, J. and Lai, L. (2013) Binding site detection and druggability prediction of protein targets for structure-based drug design. Curr. Pharm. Des., 19, 2326–2333

[8]

Hajduk, P. J., Huth, J. R. and Fesik, S. W. (2005) Druggability indices for protein targets derived from NMR-based screening data. J. Med. Chem., 48, 2518–2525

[9]

Rose, P. W., Prlić A., Altunkaya, A., Bi, C., Bradley, A. R., Christie, C. H., Costanzo, L. D., Duarte, J. M., Dutta, S. and Feng, Z. (2016) The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res., 45, D271–D281

[10]

Mitsopoulos, C., Schierz, A. C., Workman, P. and Al-Lazikani, B. (2015) Distinctive behaviors of druggable proteins in cellular networks. PLoS Comput. Biol., 11, e1004597

[11]

Lipinski, C. A. (2004) Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov. Today Technol., 1, 337–341

[12]

Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. and Hopkins, A. L. (2012) Quantifying the chemical beauty of drugs. Nat. Chem., 4, 90–98

[13]

Li, Q. and Lai, L. (2007) Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics, 8, 353

[14]

Jamali, A. A., Ferdousi, R., Razzaghi, S., Li, J., Safdari, R. and Ebrahimie, E. (2016) DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. Drug Discov. Today, 21, 718–724

[15]

Guo, Y., Yu, L., Wen, Z. and Li, M. (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res., 36, 3025–3030

[16]

Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y. and Jiang, H. (2007) Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA, 104, 4337–4341

[17]

Asgari, E. and Mofrad, M. R. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287

[18]

Wallach, H. M. (2006) Topic modeling: beyond bag-of-words. In ICML '06 Proceedings of the 23rd International Conference on Machine learning. pp. 977–984, Pittsburgh

[19]

Xue, B., Fu, C. and Shaobin, Z. (2014) A study on sentiment computing and classification of sina weibo with word2vec. In 2014 IEEE International Congress on Big Data. pp. 358–363. Anchorage

[20]

Chung, Y.-A., Wu, C.-C., Shen, C.-H., Lee, H.-Y. and Lee, L.-S. (2016) Audio word2vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv, 1603.00982

[21]

Ngo, D. L., Yamamoto, N., Tran, V. A., Nguyen, N. G., Phan, D., Lumbanraja, F. R., Kubo, M. and Satou, K. (2016) Application of word embedding to drug repositioning. J. Biomed. Sci. Eng., 9, 7–16

[22]

Kimothi, D., Soni, A., Biyani, P. and Hogan, J. M. (2016) Distributed Representations for Biological Sequence Analysis. arXiv:1608.05949

[23]

Vang, Y. S. and Xie, X. (2017) HLA class I binding prediction via convolutional neural networks. Bioinformatics, 33, 2658–2665

[24]

Kanehisa, M. and Goto, S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27–30

[25]

Zeng, Y. H., Guo, Y. Z., Xiao, R. Q., Yang, L., Yu, L. Z. and Li, M. L. (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J. Theor. Biol., 259, 366–372

[26]

Liu, T., Geng, X., Zheng, X., Li, R. and Wang, J. (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids, 42, 2243–2249

[27]

Wang, Y.-C., Wang, X.-B., Yang, Z.-X. and Deng, N.-Y. (2010) Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein Pept. Lett., 17, 1441–1449

[28]

Ottis, P., Toure, M., Cromm, P. M., Ko, E., Gustafson, J. L. and Crews, C. M. (2017) Assessing different E3 ligases for small molecule induced protein ubiquitination and degradation. ACS Chem. Biol., 12, 2570–2578

RIGHTS & PERMISSIONS

Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap
PDF (1196KB)

Supplementary files

Supplementary Material

1954

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/