Analysis of protein features and machine learning algorithms for prediction of druggable proteins

Tanlin Sun; Luhua Lai; Jianfeng Pei

doi:10.1007/s40484-018-0157-2

PDF(1196 KB)

Quant. Biol. ›› 2018, Vol. 6 ›› Issue (4) : 334-343. DOI: 10.1007/s40484-018-0157-2

RESEARCH ARTICLE

Analysis of protein features and machine learning algorithms for prediction of druggable proteins

Tanlin Sun¹ ,
Luhua Lai¹^,²^,³ ,
Jianfeng Pei¹

Author information +

History +

Abstract

Background: Computational tools have been widely used in drug discovery process since they reduce the time and cost. Prediction of whether a protein is druggable is fundamental and crucial for drug research pipeline. Sequence based protein function prediction plays vital roles in many research areas. Training data, protein features selection and machine learning algorithms are three indispensable elements that drive the successfulness of the models.

Methods: In this study, we tested the performance of different combinations of protein features and machine learning algorithms, based on FDA-approved small molecules’ targets, in druggable proteins prediction. We also enlarged the dataset to include the targets of small molecules that were in experiment or clinical investigation.

Results: We found that although the 146-d vector used by Li et al. with neuron network achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and on newly approved-targets. Enlarged dataset with targets of small molecules in experiment and clinical investigation were trained. Unfortunately, the best training accuracy was only 75.48%. In addition, we applied our models to predict potential targets for references in future study.

Conclusions: Our study indicates the potential ability of word2vec in the prediction of druggable protein. And the training dataset of druggable protein should not be extended to targets that are lack of verification. The target prediction package could be found on https://github.com/pkumdl/target_prediction.

Graphical abstract

Keywords

druggable protein / drug target / word2vec / deep learning

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Tanlin Sun, Luhua Lai, Jianfeng Pei. Analysis of protein features and machine learning algorithms for prediction of druggable proteins. Quant. Biol., 2018, 6(4): 334‒343 https://doi.org/10.1007/s40484-018-0157-2

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	The UniProt Consortium. (2017) UniProt: the universal protein knowledgebase. Nucleic Acids Res., 45, D158–D169 CrossRef Pubmed Google scholar

[2]	Wishart, D. S., Feunang, Y. D., Guo, A. C., Lo, E. J., Marcu, A., Grant, J. R., Sajed, T., Johnson, D., Li, C., Sayeeda, Z., (2018) DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res., 46, D1074–D1082 CrossRef Pubmed Google scholar

[3]	Butcher, S. P. (2003) Target discovery and validation in the post-genomic era. Neurochem. Res., 28, 367–371 CrossRef Pubmed Google scholar

[4]	Dundas, J., Ouyang, Z., Tseng, J., Binkowski, A., Turpaz, Y. and Liang, J. (2006) CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res., 34, W116–W118 CrossRef Pubmed Google scholar

[5]	Schmidtke, P., Le Guilloux, V., Maupetit, J. and Tufféry, P. (2010) fpocket: online tools for protein ensemble pocket detection and tracking. Nucleic Acids Res., 38, W582–W589 CrossRef Pubmed Google scholar

[6]	Hussein, H. A., Borrel, A., Geneix, C., Petitjean, M., Regad, L. and Camproux, A.-C. (2015) PockDrug-Server: a new web server for predicting pocket druggability on holo and apo proteins. Nucleic Acids Res., 43, W436–W442 CrossRef Pubmed Google scholar

[7]	Yuan, Y., Pei, J. and Lai, L. (2013) Binding site detection and druggability prediction of protein targets for structure-based drug design. Curr. Pharm. Des., 19, 2326–2333 CrossRef Pubmed Google scholar

[8]	Hajduk, P. J., Huth, J. R. and Fesik, S. W. (2005) Druggability indices for protein targets derived from NMR-based screening data. J. Med. Chem., 48, 2518–2525 CrossRef Pubmed Google scholar

[9]	Rose, P. W., Prlić, A., Altunkaya, A., Bi, C., Bradley, A. R., Christie, C. H., Costanzo, L. D., Duarte, J. M., Dutta, S. and Feng, Z. (2016) The RCSB protein data bank: integrative view of protein, gene and 3D structural information. Nucleic Acids Res., 45, D271–D281 Pubmed

[10]	Mitsopoulos, C., Schierz, A. C., Workman, P. and Al-Lazikani, B. (2015) Distinctive behaviors of druggable proteins in cellular networks. PLoS Comput. Biol., 11, e1004597 CrossRef Pubmed Google scholar

[11]	Lipinski, C. A. (2004) Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov. Today Technol., 1, 337–341 CrossRef Pubmed Google scholar

[12]	Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. and Hopkins, A. L. (2012) Quantifying the chemical beauty of drugs. Nat. Chem., 4, 90–98 CrossRef Pubmed Google scholar

[13]	Li, Q. and Lai, L. (2007) Prediction of potential drug targets based on simple sequence properties. BMC Bioinformatics, 8, 353 CrossRef Pubmed Google scholar

[14]	Jamali, A. A., Ferdousi, R., Razzaghi, S., Li, J., Safdari, R. and Ebrahimie, E. (2016) DrugMiner: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. Drug Discov. Today, 21, 718–724 CrossRef Pubmed Google scholar

[15]	Guo, Y., Yu, L., Wen, Z. and Li, M. (2008) Using support vector machine combined with auto covariance to predict protein-protein interactions from protein sequences. Nucleic Acids Res., 36, 3025–3030 CrossRef Pubmed Google scholar

[16]	Shen, J., Zhang, J., Luo, X., Zhu, W., Yu, K., Chen, K., Li, Y. and Jiang, H. (2007) Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. USA, 104, 4337–4341 CrossRef Pubmed Google scholar

[17]	Asgari, E. and Mofrad, M. R. (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS One, 10, e0141287 CrossRef Pubmed Google scholar

[18]	Wallach, H. M. (2006) Topic modeling: beyond bag-of-words. In ICML '06 Proceedings of the 23rd International Conference on Machine learning. pp. 977–984, Pittsburgh

[19]	Xue, B., Fu, C. and Shaobin, Z. (2014) A study on sentiment computing and classification of sina weibo with word2vec. In 2014 IEEE International Congress on Big Data. pp. 358–363. Anchorage

[20]	Chung, Y.-A., Wu, C.-C., Shen, C.-H., Lee, H.-Y. and Lee, L.-S. (2016) Audio word2vec: unsupervised learning of audio segment representations using sequence-to-sequence autoencoder. arXiv, 1603.00982

[21]	Ngo, D. L., Yamamoto, N., Tran, V. A., Nguyen, N. G., Phan, D., Lumbanraja, F. R., Kubo, M. and Satou, K. (2016) Application of word embedding to drug repositioning. J. Biomed. Sci. Eng., 9, 7–16 CrossRef Google scholar

[22]	Kimothi, D., Soni, A., Biyani, P. and Hogan, J. M. (2016) Distributed Representations for Biological Sequence Analysis. arXiv:1608.05949

[23]	Vang, Y. S. and Xie, X. (2017) HLA class I binding prediction via convolutional neural networks. Bioinformatics, 33, 2658–2665 CrossRef Pubmed Google scholar

[24]	Kanehisa, M. and Goto, S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res., 28, 27–30 CrossRef Pubmed Google scholar

[25]	Zeng, Y. H., Guo, Y. Z., Xiao, R. Q., Yang, L., Yu, L. Z. and Li, M. L. (2009) Using the augmented Chou’s pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. J. Theor. Biol., 259, 366–372 CrossRef Pubmed Google scholar

[26]	Liu, T., Geng, X., Zheng, X., Li, R. and Wang, J. (2012) Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles. Amino Acids, 42, 2243–2249 CrossRef Pubmed Google scholar

[27]	Wang, Y.-C., Wang, X.-B., Yang, Z.-X. and Deng, N.-Y. (2010) Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein Pept. Lett., 17, 1441–1449 CrossRef Pubmed Google scholar

[28]	Ottis, P., Toure, M., Cromm, P. M., Ko, E., Gustafson, J. L. and Crews, C. M. (2017) Assessing different E3 ligases for small molecule induced protein ubiquitination and degradation. ACS Chem. Biol., 12, 2570–2578 CrossRef Pubmed Google scholar

SUPPLEMENTARY MATERIALS

The supplementary materials can be found online with this article at https://doi.org/10.1007/s40484-018-0157-2.

ACKNOWLEDGEMENTS

This work was supported in part by the Ministry of Science and Technology of China (No. 2016YFA0502303) and the National Natural Science Foundation of China (Nos. 21673010 and 81273436).

The authors would like to thank Youjun Xu, Shuaishi Gao, Qiwan Hu for discussion and advices.

COMPLIANCE WITH ETHICS GUIDELINES

The authors Tanlin Sun, Luhua Lai and Jianfeng Pei declare they have no conﬂict of interests.

This article does not contain any studies with human or animal subjects performed by any of the authors.

RIGHTS & PERMISSIONS

2018 Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature

AI Summary AI Mindmap

PDF(1196 KB)

Accesses

Citations

Detail

Sections

Recommended

Received	Revised	Accepted	Published
16 Jun 2018	18 Aug 2018	05 Sep 2018	10 Dec 2018
Online First Date	Issue Date
29 Nov 2018	10 Dec 2018

About the journal

Aims & scopes

Description

Editorial board

Abstracting / Indexing

Cover gallery

Contact us

Browse

Just accepted

Online first

Latest issue

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Authors & reviewers

Online submisson

Call for papers

Editorial policy

Guidelines for authors

Download templates

Classifications via endnote

Guidelines for reviewers

Author FAQs