1 INTRODUCTION
Drug discovery is an expensive, difficult, and drawn-out process. Depending on the diseases and treatment methods, the cost of developing approved drugs increased from $300 million to $2.8 billion during the past decades [
1]. Virtual screening (VS) as a typical computational technique, has been widely applied in the early stage of drug discovery for accelerating lead discovery [
2]. A target protein with a three-dimensional structure is used in docking-based virtual screening (DockVS), which seeks to computationally place and assess small molecules one at a time at the binding site. Molecular docking could explicitly suggest new compounds with high docking scores after continually improving the binding poses of the molecules.
The currently available molecular docking programs including their released years, released organizations, descriptions, and licenses, have been summarized in the Supplementary Table S1 [
3–
19]. As the scoring functions’ definitions and sampling techniques varied greatly, they might provide different docking results for a given protein target. For real applications, it is fairly common to use two docking programs sequentially in the process of molecular virtual screening, which is shown as a well-known funnel-like workflow [
20–
31]. In the sequential docking workflow, the first docking process used could be the one with relative lower precision and with a fast speed, while the second docking process could be more precision but somehow slowly. Consequently, a balance between efficiency and docking quality is finally reached. After this filter, further evaluations with higher accuracy but low speed as well as human visual inspection were applied to choose compounds for experimental verification.
Recently, it has been reported that docking based VS processes on ultra-large databases can significantly increase the success rate of finding active hits [
32]. Meanwhile, the size of compound databases has also increased significantly. For example, the widely used ZINC database has grown from 700 thousand in 2005 to over 1.3 billion in 2019 [
33–
35]. By using parallel distributed cloud computing capabilities on the Apache Spark engine, Capuccini
et al. [
36] carried out the VS procedure on a large-scale compound library and achieved a good parallel efficiency. However, only a tiny part of the compounds will have high docking scores and the great majority of the low-score compounds that were docked occupied most of the computational resources. It is desirable for researchers to identify high-score compounds using trustworthy machine learning methods in advance, therefore the computational resources for docking could be significantly reduced.
Gentile
et al. [
37,
38] developed a deep learning (DL) classification model named as deep docking (DD) based on the docking scores of a subset of compounds to speed up the prediction of the remaining compounds from an ultra-large library of a billion compounds (ZINC-15). Several active learning-based workflows have recently been proposed [
39–
43]. Yang
et al. used DOCK3.7 and Glide SP to do an active learning process. They used a 0.1% random subset of the total library and then iteratively updated the DL models with an active learning strategy. This VS process could recover the majority of the top-ranking compounds. However, it still needs to dock about 5% of the total compounds which is still a greater computational burden since DD that only dock 1%. As suggested by Graff
et al. [
44], pruning the searching space is necessary to accelerate the whole screening process. Therefore, it is highly valuable to investigate and develop an accelerated VS approach for reducing the computational burden of molecular docking on low-score compounds in real-world scenarios. Scoring function is often used to estimate potential binding affinities of given compounds against given targets of interest. It is intuitive and straight-forward to build regression models to fit such affinity scores produced by certain docking program. Machine learning- and deep learning-based scoring functions have been developed with a pronounced trend [
45,
46]. The speed and accuracy of prediction have been improved in these processes. However, scoring functions in many real-world cases do not correlate well with the binding affinities. Moreover, a significant obstacle is how effectively one scoring function can discriminate true actives from numerous inactives [
45].
To improve the reliability of docking assessments, multiple docking programs have been used in this study. Considering the fundamental principles of docking programs and the potential advantages of deep learning, we developed a new rapid and practical VS strategy, called DL-DockVS, which is composed of a set of two deep learning models (DL-DockVS-R as regression models and DL-DockVS-C as classification models) in tandem to mimic the funnel-like VS procedure which could eliminate many decoy compounds. We utilized the output scores from two docking programs to construct DL-DockVS-R and DL-DockVS-C models. The rationale for building one regression model of percentile and a classification model is illustrated as follows: for different scoring functions, various scales of them often make it hard to determine a special threshold value. The ranking order instead of the absolute scores of docking scores are essential for the first filtration, as the regression task for DL-DockVS-R models. And it can easily be used to predict and select top ranking compounds from an ultra-large library. Subsequently, these top-ranking compounds are predicted by the DL-DockVS-C models to return potential active labels which would be chosen for further evaluation. With comprehensive evaluations, the DL-DockVS approach was shown to have a decent capacity to find known active compounds for 10 specific protein targets. The active enrichment factor was in range of 5 and 17. Further analysis of the target BRAF revealed that DL-DockVS had a high screening speed. These findings demonstrate that this approach can confidently and swiftly screen out molecules with low docking scores which are probably inactive. It is considered to be a good and useful filter in routine VS processes on the ultra-large chemical libraries.
2 RESULTS AND DISCUSSION
2.1 DL-DockVS model performance
To reflect the consistency of the model’s performance and its generalizability, we summarized the training statistical metrics of those models for the 10 selected targets in Supplementary Fig. S1. The area under curve (AUC) and root mean squared error (RMSE) metrics on the test sets of the DL-DockVS models for each target are presented in Tab.1. For those DL-DockVS-R models, the RMSE values range from 0.11 to 0.14, illustrating an acceptable error in ranking prediction. For those DL-DockVS-C models, the AUC values range from 0.89 to 0.96, indicating an excellent ability to distinguish between positive data (top 20% ranked compounds by Dock-2) and negative data (the rest of the compounds by Dock-2).
2.2 Evaluations of DL-DockVS-R models
2.2.1 Model evaluation on external test sets
Considering there may be topological bias in the DUD-E dataset, we did not intentionally incorporate these compounds into the training set, and just used it as one external test set only [
47]. We labelled these as follows: DUD-E actives/decoys were docked by VinaLC and compounds with docking scores above the top 20% percentile of the Training-Set1 (See Experimental) were selected. The selected compounds were compared with the compounds selected by the DL-DockVS-R models. The results are listed in Tab.2. It is shown that among the 10 targets, except the target of ACE, the consistency rates between the VinaLC docking and DL-DockVS-R models were more than 75.0%, with the highest consistency rate of 98% for the CDK2. These results suggest that the DL-DockVS-R models are actually good at learning the filter process of the top 20% ranking and achieve an acceptable performance compared to corresponding docking methods.
2.2.2 Model evaluation on the ChEMBL subset
We used DL-DockVS-R models to predict the ranking percentage of the compounds in the ChEMBL subset. The compounds with predicted top 20% rankings were further docked to their targets using VinaLC to test their real distributions. Fig.1 and Supplementary Fig. S2 show the VinaLC docking score distributions of those selected compounds from the ChEMBL subset, the whole ChemDiv dataset, and the top 20% ranked compounds against each protein target. In Fig.1, we can see that most of the selected ChEMBL compounds are located in the top 20% ranked compounds region of the ChemDiv subset. It suggests that our DL-DockVS-R models have a strong ability to enrich compounds with a high docking score. Therefore, the models can distinguish compounds with potentially high docking scores for specific targets in a quick manner without real docking calculation for the whole dataset.
2.3 Evaluation of DL-DockVS-C models on external test sets
In the previous section, the DUD-E actives/decoys docked by VinaLC and selected compounds with the top 20% ranking percentile of the Training Set1 were chosen then further docked by rDock. The compounds with docking scores of above the top 20% percentile of the Training Set2 were labeled as positive, while the bottom 20% of them were labelled as negatives. These compounds were further predicted by the DL-DockVS-C models. Tab.3 shows the statistical results of the various metrics on the DUD-E external data set. The results show that, except the target of DPP4, the accuracies of the other 9 targets are above 70%. The specificity values of the 8 targets are above 72%, except for the two targets of DPP4 and BRAF. As for the true positive rates (TPRs), the maximum and minimum values were 89.5% of BRAF and 27.5% of JAK2. It is obvious that the performances of DL-DockVS-C on various protein targets differ. Generally, the DL-DockVS-C models show good performance in terms of accuracy and specificity on the external test sets from DUD-E.
2.4 Performance of DL-DockVS models on retrieving active compounds
To further verify whether the DL-DockVS models can identify active compounds towards specific targets, the active compounds for 10 targets were collected from ChEMBL. The active compounds for 10 targets were filtered according to a set of activity thresholds. For instance, if a compound has an IC50/EC50 value of less than 50 μM, it is labeled as a positive sample, otherwise negative.
To validate the abilities of recalling actives, about 1.9 million molecules from the ChEMBL database were first predicted with the DL-DockVS-R models. Then the compounds with the predicted top 20% ranking were predicted by the DL-DockVS-C models. The results are described in Tab.4, showing the percentage of the retrieved active compounds under the different activity cutoffs. The statistical results indicated that most of the models can identify over 50% of actives towards specific targets, such as the recall percentages of ADRB1, BRAF, EGFR, JAK2, LCK, and VGFR2 were 65.78%, 79.30%, 68.63%, 51.88%, 74.61%, and 52.72%, respectively. It means most of the active compounds could be recalled by our DL-DockVS models for most targets.
The final number of compounds with positive label as about 4% of the total compounds predicted. Which means our model’s real behavior is somehow mimic the step-by-step docking process (20%×20% as the real process returned). Therefore, the enrichment factor here we denoted as EF0.04. The summarized results for 10 targets are listed in Tab.5. From this table, the EF0.04 value ranges from the lowest 5.86 (CDK2) to the highest 17.03 (ADRB1). It revealed that DL-DockVS models had a good enrichment ability of active compounds.
2.5 Running time comparison
To demonstrate the speed advantages of DL-DockVS, the target BRAF was selected as a case. The drug-like molecules (containing 981,247,974 compounds) from ZINC-15 were selected in this case. The statistical results in Tab.6 show that, based on the above available computing resources, the prediction speed of the DL-DockVS is greatly improved compared to traditional docking processes, with an average speed improvement about 2000 times. DL-DockVS could quickly filter out low-score compounds against specific targets. Moreover, the compounds with high docking scores (top 4%) were obtained at an acceptable rate.
In the DD protocol [
38], DD needed to dock 1% of the molecules while retrieving 90% of the best-scoring structures. The DL-DockVS approach has similar performance as shown on an example in Tab.6. It weeded out 99.67% low-score compounds on the BRAF targets from the whole dataset.
Such high rate of clearance suggested that the chance to obtain more tight binding/high docking score compounds is at a computational cost of even more low scored one for an ultra-large library virtual screening. Based on the result from Tab.6, for example, when screening 10–20 million purchasable compounds worldwide [
48] about 63−126 thousand compounds are predicted by DL-DockVS for further evaluation, such an order of magnitude within the range of the ability of normal docking computation.
3 DISCUSSION
Due to the availability and accessibility of the docking software, we only tested the combination of VinaLC and rDock, which tends to be orthogonal to each other. Conceptionally, this approach is believed to be general to other combinations of docking programs, and flexible to accommodate different requirements in variant projects.
To explore a reasonable threshold in the process of obtaining the training dataset, the DUD-E active compounds were selected to carry out statistical analysis on the relationships between the enrichment ratio of the active compounds and the top value of the two docking programs for 10 selected targets, respectively. Representative results of 4 targets were listed in Fig.2. Others are shown in Supplementary Fig. S3. The x-axis represents the percentile of the top 20% ranked compounds sorted by the docking score, and the y-axis represents the proportion of DUD-E active compound enrichment status for the corresponding target. For most of targets, top 20% of the compounds docking by two docking programs can enrich 40% of active compounds in DUD-E. Thus, the top 20% was selected for the preparation of the training dataset. Overall, the active compound enrichment ability of rDock is better than VinaLC
To obtain a higher enrichment ratio, we further explored the usage ordering of two docking programs. The enrichment ratios of different-ordering combinations are summarized in Fig.2. The red curve represents the enrichment ratio of the corresponding target by the rDock program, the blue one represents that of the VinaLC program, and the green one represents the re-docking of the top 20% compounds using the rDock after docking with VinaLC. Consequently, the orange one represents the enrichment ratio in the order of rDock first and then VinaLC.
As shown in Fig.2 and Fig.2, when the performance of two programs is relatively similar under a certain target, it can be observed that the combination of two docking programs is better than the use of a single docking program. Nevertheless, for a common circumstance, programs with good performance may be limited in use for various reasons and cannot be used on a large scale. In this case, we can only use a slightly weaker program for large-scale screening. According to Fig.2 and Fig.2, it can be observed that the end points on the red lines are higher than that of the blue lines, indicating that the performance of the two docking programs is quite different and the red one get a good enrichment performance. Because of the two targets docked by VinaLC had a low enrichment ratio of DUD-E active compounds, VinaLC is a step of limiting enrichment ratio. Therefore, the serial use of VinaLC and rDock does not lead to an improvement in the final performance. In this instance, the green line can be considered as a simulation of selecting a program (such as VinaLC) with a slightly lower enrichment rate that can be quickly docked for rough screening in the first step, and then use the docking program (such as rDock) with a higher enrichment rate for docking in the second step. The figures show that the end points of the green lines are higher than that of the blue ones and illustrates that this strategy can improve the active enrichment capability compared with a single docking program (blue lines). However, the loss of active enrichment could not be avoided compared to the programs that have a good performance (red lines).
4 CONCLUSION
In this work, we demonstrated a fast and reusable approach, DL-DockVS, which mimics the funnel-like screening process using deep learning models. We implemented it by utilizing the docking results of certain targets on a commercially purchasable ChemDiv subset with two open-source docking programs (VinaLC and rDock). Various statistical results showed that the models could have good performance in the VS process. Moreover, compared with the traditional VS process, the DL-DockVS we constructed can realize virtual screening on an ultra-large compound library with an improvement of runtime speed. These well-trained models could be easily extended to predict other chemical libraries. With the applications of molecular generative models in the field of drug development in recent years, it is believed that DL-DockVS would be used as a rapid scoring metric for deep generative models to efficiently produce a set of diverse compounds with high docking scores towards a specific protein target.
5 METHODS
5.1 Model construction and usages
The workflow of the DL-DockVS construction and usage is illustrated in Fig.3. For a certain protein target, two supervised training processes are implemented. For the first training process, the compounds from the training library (N compounds) are docked onto the given target. Dock-1 is used as the first docking program, and all the docking results are used as the first training set (Training Set1) for this target. The compounds with the top 20% ranking from the first docking results are selected for the second docking process, and the docking results from this setup are used as the second training set for this target (Training Set2). We then label these training sets according to the docking results. For the first regression task, the ranking percentage values of the compounds in Training Set1 are used as ground truth labels. For the classification task, the top 20% ranked compounds within the Training Set2 are selected as positive samples, and the remaining compounds are used as negative samples. The regression models (DL-DockVS-R) and the classification models (DL-DockVS-C) are trained based on Training Set1 and Training Set2, respectively. After evaluation of the models, they can be used for ultra-large library filtering. During ultra-large library filtering, the DL-DockVS-R models are first used to obtain the compounds which would have scores with top 20% ranked values, and then these compounds are predicted by the DL-DockVS-C models. The compounds with positive labels can be used for further examination.
5.2 Molecular docking programs and parameter settings
We summarized available molecular docking programs in Supplementary Table S1. Due to intellectual property issues, two open-source and freely available docking programs, namely VinaLC [
49,
50] (
vina-based implementation for large datasets) and rDock [
16], were selected in this research. VinaLC is based on an experience scoring function developed by the MGL Tools laboratory. Compared to AutoDock 4.0, VinaLC improves the average accuracy of binding mode prediction and accelerates the search speed by using a simpler scoring function. rDock is a fast and versatile program that is used to dock molecules against both proteins and nucleic acids. Both programs can be installed and run on a computing cluster with a lot of CPU cores. This makes it possible for VS procedures to be done in a few days.
In our cases, the protein structures were preprocessed using ADTools, and the small molecules were prepared using OpenBabel [
51]. For the parameters used in
vina, the grid center was defined as the native ligand center in the crystal structure, a cubic grid was
. For rDock, the
rbcavity program is used to determine a grid parameter from the native ligand with default parameters. VinaLC was selected as the first docking program (Docking-1), and rDock was selected as the second docking program (Docking-2). Both docking scores of Docking-1 and Docking-2 were saved as csv-format files (Additional file 3).
5.3 Deep learning model construction
An open source toolkit, Chemprop (github website (chemprop/chemprop)), is a deep learning based molecular property modeling and prediction toolkit [
52]. It was used to develop our DL-DockVS models. The architecture is based on graph convolutional networks, which treats molecules as an attributed graph with node features (atoms) and edge features (bonds) for processing. This toolkit was used to develop DL-DockVS-R and DL-DockVS-C models with default hyperparameters. Besides, the setting parameters of
rdkit_2d_normalized feature and
no_features_scaling are added to the training process to improve the generalization capability of the models. Both the DL-DockVS-R and DL-DockVS-C models were trained and validated on 80% of the random split set and tested on the remaining 20% of the datasets. The hyperparameters, such as
hidden_size, depth, dropout, and
ffn_num_layers, were optimized by 10-fold cross validation. The optimal hyperparameters were used to construct the DL-DockVS-R and DL-DockVS-C models.
5.4 Model evaluation metrics
For the regression tasks, the RMSE is calculated to evaluate the performance of models to reproduce the docking ranking. For the classification tasks, the metrics of AUC, Accuracy, Precision, Specificity, FPR (false positive rate), TPR, and F1-score are calculated to evaluate the model performance on external datasets. The formulas are as follows:
The above formulas are the evaluation metrics of classification models. TP is true positive. TN is true negative. FP is false positive. FN is false negative. Accuracy is the percentage of correct predictions for the test data. Precision is defined as the fraction of correctly classified samples among all the data points that are predicted to belong to a certain class. Specificity is the proportion of the correctly predicted the negative samples. TPR is the fraction of correctly classified samples among all data points that are predicted to belong to a certain class. F1-score is the harmonic mean of Precision and Recall.
5.5 Datasets and their usages
In this study, several datasets were used for different purposes.
ChemDiv subset. For the compound library to be docked into each target, a ChemDiv library with 1.25 million purchasable compounds was clustered with a Tanimoto similarity threshold of 0.4 to yield a clustered subset of 287,216 compounds (named as the ChemDiv subset, and the SMILES-format files of the compounds are available in Additional file 2).
Test set from DUD-E dataset. The descriptions of the used target proteins are listed in Tab.7. This table also includes the numbers of their experimentally validated active compounds as well as decoy compounds from the DUD-E database [
53] and the evaluation of the crystal structure restoration ability of VinaLC and rDock. This table shows that all RMSD-rdock and RMSD-VinaLC are less than 2 Å. Therefore, VinaLC and rDock are suitable for these target systems. They are used as validation sets for DL-DockVS. All these compounds were also docked with two programs to corresponding targets, but were not used in the training process. Their SMILES strings are listed in Additional file 4.
Test set from ChEMBL dataset. In an external test stage, we tested DL-DockVS performance on two self-constructed datasets. To evaluate DL-DockVS-R models, a relatively small dataset was constructed by random selecting 500,000 compounds from the ChEMBL database (Their SMILES strings are listed in Additional file 5). To evaluate DL-DockVS-C performance, the compounds from the ChEMBL database are labeled as active or inactive according to the IC50 value of each target. Their SMILES strings are provided in Additional file 6. Therefore, the whole ChEMBL27 database containing about 1.9 million molecules were predicted then filtered by the DL-DockVS-R models, and those filtered molecules would be used as the input to the DL-DockVS-C models.
Test set from ZINC. The ZINC-15 dataset was used to illustrate the speed of DL-DockVS on a large dataset for target BARF as an example.
The Author(s). Published by Higher Education Press.