INTRODUCTION
With the development of high-throughput sequencing technique, the number of sequenced proteins has been growing steadily and exponentially [
1]. And with the deeper understanding of Human Genome Project, many proteins have been identified as crucial members in disease network. However, transferring those fruitful results into clinical use is not that easy. The number of FDA-approved drug targets have been increasing much slowly [
2]. One of the most critical reasons for the failure of drug discovery pipelines is the choice of wrong protein targets [
3]. Computational tools have been widely used in drug discovery process and it has reduced the time and cost comparing to experiment. For druggable protein prediction, some researchers evaluate druggability by detecting whether a protein has a proper druggable binding pocket. The existing software includes Cast-p [
4], fpocket [
5], PockDrug [
6], and CAVITY [
7]. By NMR experiment, Hajduk
et al. identified that only a small number of structural features were enough for druggable protein prediction, such as protein surface polarity, surface complexity and a few pocket information. They achieved an accuracy of 94% by using those features [
8]. However, heavily depending on the availability of three dimensional structures limited the use of these methods, since only 1% of the proteins have structures being solved [
9]. Mitsopoulos
et al. found that the network topology of druggable proteins differs significantly from undruggable proteins. Using 300 topology parameters, they successfully predict general druggable proteins and cancer druggable proteins [
10]. Since whether a small molecule is a potential candidate of drug can be determined approximately and simply by a number of selected physicochemical features [
11,
12], we believe that druggable proteins may also be calculated in terms of simple physicochemical features or sequence composition statistics. Li
et al. had developed a 146-d vector to represent a protein sequence. By using support vector machine (SVM) method, they achieved an accuracy of 84% on only 186 positive samples [
13]. Now the number of approved drug targets has been increasing [
2]. With the enlarged dataset, Jamali
et al. applied a 443-d vector and various machine learning algorithms to determine the best model. Neuron network stood out with the best accuracy of 89.78% [
14].
In this paper, we tested the performance of different combinations of protein features and machine learning algorithms in druggable proteins prediction. FDA-approved small molecules’ targets were used as training samples. Although the 146-d vector used in Ref. [
13] (LQL-v) with neuron network (NN) achieved the best training accuracy of 91.10%, overlapped 3-gram word2vec with logistic regression (LR) achieved best prediction accuracy on independent test set (89.55%) and newly approved targets. Furthermore, enlarged dataset with targets of small molecules in experiment and investigation were trained. Unfortunately, the best training accuracy was only 75.89%. In addition, we applied our models to predict potential targets for references in future study. Our study indicates the potential ability of word2vec in the prediction of druggable proteins. And the training dataset of druggable proteins should not be extended to targets that are lack of verification.
RESULTS
We tested the performance of different combinations of protein features and machine learning algorithms in druggable protein prediction. The protein features we chose are listed in Table1. The machine learning algorithm we chose are NN, SVM, LR, Decision Tree (DT), Gradient Boost Decision Tree (GBDT), K-nearest (KN), Random Forest (RF) and Naïve Bayesian (NB) .
The small dataset we used was retrieved from FDA approved small molecules’ drug targets in DrugBank and was constructed by Jamali
et al. [
14,
15]. The number of positive samples is 1224. A large dataset was constructed by us, in which experimental and clinical investigational small molecules’ targets were included. The large dataset contains 5503 positive samples. Negative samples were screened according to the rules in Ref. [
14]. For both the small and the large dataset, we sampled the number of negative samples three times. In total, three batches of 1217 negative samples for small dataset and three batches of 5498 negative samples for large dataset were constructed. For an unbiased comparison, independent and external test set were constructed individually. The details of protein features, machine learning parameters, dataset construction are described in the Materials and Methods.
Amino acid distributed representation
One of the most critical and basic tasks in Nature Language processing (NLP) is to learn the representative embedding of words. Some applications of NLP such as topic identifications used bags of words (BOW) model [
18]. Although performed well in some tasks, its application is limited by dimension explosion and incapability of semantic representation. Continuous vector representation was developed to embed the meaning of every word in an n-dimensional space. Trained with contexts, words with similar meanings locate closely in the space. Word2vec is one of the most successful algorithms of that kind, and it has various applications [
19–
21].
In protein sequence representation, AC, CT, LQL-v and Jamali-v used in this work are analogies of BOW models, whose elements are determined by counting the sequences according to the invented features. Recently, word2vec has been recruited to study protein sequences [
17,
22,
23].
We took three forms to train the word2vec model. For example, given a protein sequence:
LRQTVKNTVSQV
(1) Single, which means single amino acid with its contexts were taken to train and generate the embedding vector of every amino acid;
(2) Overlapped 3-gram, which means by breaking the original sequence into 3 window size overlapped k-mers and training with their contexts, the embedding vector of every overlapped 3-gram was obtained.
LRQ RQT QTV TVK VKN…
(3) Non-overlapped 3-gram, which means by shifting the 3 length window to generate 3 new sequences, k-mers were generated with the same method as (2). After training with their contexts, the embedding vector of every non-overlapped 3-gram was obtained.
LRQ TVK NTV SQV
RQT VLM TVS…
QTV KNT VSQ…
The vector space (on the training set) for single amino acid is shown in Figure 1. The color represented individually the intensity of hydrophobicity (Figure 1A), polarity (Figure 1B) and van der Waals volume (Figure 1C) of every amino acid. The vector space is different from that in Ref. [
23]. It is easy to understand that this is due to the different training dataset and the different goal of the studies. But in both spaces, amino acid with similar color (property) tended to cluster together (although with some outliers).
Figures 2 and 3 are the vector spaces (on the training set) for overlapped 3-grams and non-overlapped 3-grams. Comparing the two spaces, the overlapped 3-grams one was more distributed, with three clusters formed. The distributed space also existed in Ref. [
17]. The color intensity of the non-overlapped 3-grams one was more gradient distributed, which was also true but less obvious in the overlapped 3-grams one.
For comparison, we constructed an artificial scramble space by randomly shuffling the labels of 3-grams in overlapped and non-overlapped space. To quantitatively measure the continuity of those selected properties in protein space, for each property, we calculated the best Lipschitz constant (the minimum k) ratio (i) between overlapped space and random space; (ii) between non-overlapped space and random space (see Methods). It is an illustration of the smoothness of a space, and reflects the richness of information. From Table 2, we could see that the ratio k between any properties in trained 3-grams space (both overlapped and non-overlapped) and random space, although not very significantly, was lower than 1, which suggested that both of the trained 3-grams spaces were smoother than random space, containing richer information. However, no significant differences could be found between overlapped space and non-overlapped space.
Comparison of prediction accuracies for different protein features in combination with different machine learning algorithms
We tested the performance of different combinations of protein features and machine learning algorithms in druggable protein prediction. We randomly separated 90% of the whole dataset as training set and used the other 10% as independent test set. The 5-fold cross-validation (5-CV) training results are shown in Table 3, and the results of model prediction on the independent test set are shown in Table 4. We use accuracy as a measurement for performance while the overall performance metrics could be found in the Supplementary Materials.
As for the protein features tested, LQL-v with NN achieved the best training accuracy of 91.93%. Jamali-v with NN achieved a training accuracy of 88.12%, which was comparable to the result in Ref. [
14] (we had separated 10% of training samples as independent test set but Jamali
et al. did not). The overall performance of word2vec based features were comparable to LQL-v and Jamali-v; overlapped 3-gram word2vec with NN achieved the second best training accuracy of 90.23%. Overlapped 3-gram word2vec performed better than non-overlapped 3-gram and single word2vec. The training accuracies of sequence statistic based features AC and CT were only around 60%‒70%, which were far worse than the others. In summary, LQL-v, Jamali-v and word2vec performed comparatively well with most of the machine learning algorithms. As for the machine learning methods tested, except for NB, DT and KNN, all of the other algorithms achieved stable performance on the three best protein features analyzed. For the independent test set prediction, overlapped 3-gram word2vec with LR achieved an accuracy of 89.55%, which was the best of all combinations. We sampled negative samples and independent dataset three times and the results were approximately the same. We then used LQL-v and overlapped 3-gram word2vec individually to train the whole small dataset as our prediction models. Overlapped 3-gram word2vec performed better on newly approved targets than LQL-v (Table 5).
The performances of training on the larger dataset
Comparing to Li
et al.’s result, which was based on the limited number of FDA-approved small molecules’ targets ten years ago, the accuracy had been increased a lot from 84% to 91.93%, with the same protein features (LQL-v) and machine learning algorithm (SVM) used. To test whether a larger training dataset could make contribution to current results, an enlarged dataset with investigational and experimental molecules’ targets, which increased the training samples by 2 fold, was built and trained by LQL-v and overlapped 3-gram word2vec, with various machine learning methods. Unfortunately, the best prediction accuracy was only 76.44% (Table 6) by LQL-v with GBDT. We then used the small dataset trained models to test the enlarged dataset with the exclusion of the training samples. 50%‒70% of the investigational and experimental molecules’ targets in the enlarged dataset were recognized as positive (Table 7). Of them, 1,587 were predicted as positive by at least 5 models for each property(LQL-v and overlapped 3-grams) , which we defined as “druggable”, 856 were predicted as negative by all of the models, which we defined as “undruggable” and the others were defined as “undefined” by us. Druggable proteins were tagged with 3256 pathway labels by Kyoto Encyclopedia of Genes and Genomes (KEGG) [
24] and the top 20 pathways were listed in Table 8. From the table, we could see that some of the top tags were related to diseases pathways, such as metabolites, cancer, neuron diseases, infection diseases,
etc. The whole list of the prediction of the investigational or experimental molecules’ targets is supplied in the Supplementary Materials.
DISCUSSION
In this paper, we tested the performance of different combinations of protein features and machine learning algorithms in druggable proteins prediction. As the roles of word representations play in NLP, protein features or protein codings are also critical and basic in protein function prediction. AC and CT, which are purely sequence statistic based features, have been successfully applied in protein subcellular location prediction [
25], protein-protein interaction prediction [
15,
16], structural class [
26,
27],
etc., but failed in druggable protein prediction. The reason might be that more physicochemical features of amino acids are needed in this task. In contrast, the prediction accuracies of LQL-v and Jamali-v, which included physicochemical features, were far much better. Although word2vec are sequence based features, they performed comparably as LQL-v and Jamali-v. The reason might be that protein properties had been embedded in the vectors learned by training sequences contexts. This was partly confirmed by the best Lipschitz constant (the minimum k) ratio. Comparing with random space, the 3-grams spaces were smoother and therefore information richer. As for the three forms of word2vec methods, overlapped 3-gram vectors distributed more widely in the space, and they achieved comparably better prediction accuracy. It will be interesting to study the relationship between vector distribution and predication accuracy.
The number of FDA approved small molecules’ targets has increased comparing with ten years ago. The same model trained with this enlarged dataset achieved an accuracy improvement of around 5%. It indicated that the model was benefit from more training samples. However, with the addition of targets of small molecules that in experiment or in investigation, which enlarged the training examples further to 2 folds, the predication accuracy was decreased to a much lower level. It might suggest that many of the targets in research were difficult for small molecules to target and could not be included in the positive training sets. This was partially indicated by that only 50%–70% of the positive samples in the enlarged dataset were recognized as druggable targets. Though the concept of drug target is ever changing as with emerging new strategies such as proteolysis-targeting chimeras(PROTAC) [
28], proteins regarded as undruggable before can become druggable now,substantial part of the protein targets in experiments and clinical investigations are predicted undruggable by our models. These results might be helpful for researchers to recheck their targets in investigation.
CONCLUSION
We tested the performance of different combinations of protein features and machine learning algorithms in druggable proteins prediction. Although the 146-d vector used by Li et al. (LQL-v) with neuron network achieved the best training accuracy with 91.10%, overlapped 3-gram word2vec with logistic regression achieved best prediction accuracy on independent test set (89.55%) and newly approved targets. Although not including amino acids physicochemical properties apparently, word2vec encoded features performed comparably as other carefully designed features. Furthermore, enlarged dataset with targets of small molecules in experiment and investigation was trained. We showed that model prediction accuracy could not benefit from enlarged dataset with targets of small molecules in experiment or investigation. In addition, we applied our models to predict probable targets for references in future study, and most of the predicted targets were diseases- related.
MATERIALS AND METHODS
Dataset
Small dataset: The small dataset was created by Jamali et al. The positive samples were FDA-approved small molecules’ targets retrieved from DrugBank. After removal of those sequences that contained rare amino acids and those could not be coded to any of the protein features we used, there remained 1209 sequences.
Enlarged dataset: The enlarged dataset was created by us. The positive samples not only included the FDA-approved small molecules’ targets, but also included the experimental or investigational small molecules’ targets in DrugBank and Therapeutic Target Database (TTD). A total of 5503 positive samples were included.
Negative samples: Negative samples were screened out by removing protein sequences from Swiss-Prot that were (i) in positive samples; (ii) in DrugBank or TTD that were classified as in experiment or in investigation; (iii) proteins that were in the same families with (i) and (ii). Details could be found in Ref. [
14]. We sampled negative samples three times since (i) the number of screened negative samples was more than the number of positive samples, we need to sample them to be approximately equal; (ii) to give solid evidences that one particular combination of protein features and machine learning algorithm performed better than others.
Three batches of 1235 negative samples for small dataset and three batches of 5498 negative samples for large dataset were constructed.
Independent dataset: We first sampled 90% of the dataset to train the models and the rest 10% as independent test set.
Newly approved external dataset: We downloaded the sequences from DrugBank (https://www.drugbank.ca) that were tagged as “approved” (update 2018-04-02). Sequences that were the same as small dataset (training set) were excluded and 1,419 were remained.
Protein features
Auto-covariance (AC): The protein sequence was transformed by the following equation:
Where
refers to the
j-th descriptor,
i is the position of the protein sequence
,
is the normalized
j-th descriptor value for
-th amino acid,
is the length of the protein sequence
X, and
lag is the value of the lag. In this way, proteins with variable lengths could be coded into vectors of equal length (
j×
lag).
In this study,
was seven physicochemical properties (hydrophobicity, hydrophilicity, net charge index of side chains, polarity, polarizability, solvent accessible surface area, volume of side chains); Guo
et al.selected a value of 30 for the
lag and we also used this value. Consequently, the vector contained 210 numbers (7 × 30)[
15].
Conjoint Triad (CT): First, all 20 amino acids were clustered into seven groups according to their dipole and side chain volumes. Next, each amino acid from a protein sequence was replaced by its cluster number. For example, the protein sequence:
was replaced by:
Then, a 3-amino acid window was used to slide across the whole sequence one step at a time from the N-terminus to the C-terminus.
By calculating the frequency of the combination of each three numbers:
The protein P was represented by a vector of 343 numbers, all of which are zero except for
f276 (356),
f89 (562),
f13 (621),
f149 (214),
f71 (142),
f158 (424),
f23 (241), and
f4 (411).
LQL-v: The composition of 20 amino acids formed the first 20 dimensions. Then the amino acids were clustered into 3 types of amino acids for each of six physicochemical property (hydrophobicity, polarity, polarizability, solvent accessibility and normalized van der Waals volume). Composition, transition and distribution of each type and each property was calculated. For example, amino acids were clustered into polar, neutral and hydrophobic for hydrophobicity property. For “Composition”, 3 dimensions were calculated: the percentage of polar, neutral and hydrophobic; For “Transition”, 3 dimensions were calculated: the percentage of polar transferred to neutral, neutral transferred to hydrophobicity and hydrophobicity transferred to polar; For “Distribution”, 5 dimensions were calculated for each of the 3 types of amino acids: the location percentage of the first, 25%, 50%, 75% of that type. A total of 146 dimensions vector were calculated for LQL-v.
Jamali-v: Three groups of protein features were calculated including 23 features representing physicochemical properties of proteins, 20 features including frequency of each amino acid in protein sequence and 400 features of frequency of dipeptides in protein sequences. A total of 443 dimension were calculated for Jamali-v.
Word2vec: Word2vec is the name of a series of models that are trained to produce word embedding vectors. Two models are popular: continuous bag-of-words (CBOW) and continuous skip-gram. The former predicts current word from a window surrounding the context words, while the latter uses the current word to predict the surrounding window of context words. Hierarchical softmax and negative sampling are the two main training methods. The former uses a Huffman tree to maximize the conditional log-likelihood, while the latter minimize the log-likelihood of sampled negative instances. Skip-gram model with window size 8, and hierarchical softmax were recruited in this work (indicated in Figure 4). Three forms of vectors (Figures 1–3) were applied separately to train the model. For single model, the dimension was set to 15 and for 3-gram Overlapped and Non-overlapped model, the dimension was set to 200. We used word2vec program in genism python NLP package (https://radimrehurek.com/gensim/) to train and compute the embedding vectors.
Models
Models in scikit-learn(http://scikit-learn.org/stable/) were recruited to perform machine learning tasks. The parameters of RF, LR, DT, GBDT and NB were set as default.
SVM: Grid search was performed to choose parameters. For both small and large dataset, the best prediction accuracy were achieved when C equal to 1000 and gamma equal to 0.001.
NN: Three layers of NN were constructed, of which the first one was the input layer, the second one was the hidden layer and the last one was the output layer. Activation function for the output layer was sigmoid, and for the hidden layer was relu. RMSprop was used as optimization method. Grid search was performed to choose parameters such as numbers of neuron in hidden layer, training epoch numbers and batch sizes. The parameter sets that achieved the best prediction accuracy were listed in Table 9.
Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature