Home

Awesome

DEEPScreen: Virtual Screening with Deep Convolutional Neural Networks Using Compound Images

alt text

Descriptions of folders and files in the DEEPScreen repository

Development and Dependencies

DEEPScreen is a collection of command-line based prediction models written in Python 3.x. DEEPScreen was developed and tested in MacOS but it should run in any Unix-like operating system.

Pre-trained ready-to-use prediction models are available here. However, it is possible to build and run the models (for any target protein, as long as the training data is provided) with the desired hyper-parameters on any standard computer with a Unix-like operating system.

Please install all dependencies listed below. The versions given below are the ones used in the development procedure; however, newer versions of the listed packages should work without problems. In the case that RDkit is installed using an environment, the other dependencies should be installed to the same environment as well. Also, Python version of the environment should be 3.x.

Python 3.5.2

Tensorflow 1.12.0

Tflearn 0.3.2

Sklearn 0.19.2

Numpy 1.14.5

CairoSVG 2.1.2

RDkit 2016.09.4

OpenCV 3.3.0

Please refer to the following sections for step-by-step guidelines for using DEEPScreen.

How to run pre-trained ready-to-use DEEPScreen models to generate DTI predictions

Step-by-step operation:

  1. Install the listed dependencies

  2. Clone the DEEPScreen repository

  3. Check if the target(s) of interest is among the 704 DEEPScreen targets, and if so, find the ChEMBL identifier(s) of the target(s) of interest. The source file for these operations: 'DEEPScreen_704_Targets_UniP_EntN_GenSym_Org_ChEid.txt'

  4. Search for the ChEMBL identifier of the target(s) of interest in the model files folder (here) to find and to download the necessary model file triplet(s), as the model filenames contain ChEMBL identifiers (example below). It is sufficient to download only the model file triplet of the target protein of interest, since target-based predictive models are independent from each other.

  5. Place the model file triplet(s) in the tflearnModels folder

  6. Prepare the test compounds file including the SMILES representations of the compounds to be scanned against the target of interest, and place it under the trainingFiles folder. This should be a tab-seperated file with a header, where the first column is the query compound identifier and the second colunmn is the smiles strings. You could have additional columns, which will be discarded by the script. There is a sample file (i.e. sample_test_compound_file.txt) under the trainingFiles folder.

  7. Run the loadDEEPScreenModel.py script individually for each target of interest, to generate the predictions (example below).

Example:

The model files for an example target CHEMBL286 (human renin protein, UniProt accession: P00797) are under tflearnModels folder. The the model files for the target CHEMBL286 are as follows:

Run loadDEEPScreenModel.py script, while inside the bin folder of the local repository, to provide DTI predictions for a set of compounds. The arguments of this script are as follows:

python loadDEEPScreenModel.py  <target_id> <model_name> <filename_of_compound_smiles> <best_threshold_value> 

where <target_id> is the ChEMBL id of the target protein, <model_name> stands for the name of the model for the corresponding target stored under the tflearnModels folder (without the filename extension), <filename_of_compound_smiles> is the name of the test compounds file (including SMILES of the query compounds) inside the trainingFiles folder, and <best_threshold_value> is the score cut-off (threshold) value used for the binary prediction decision (active/interacting/positive vs. inactive/non-interacting/negative) that yielded the best pedictive performance during model training/validation/test (this value is printed on screen as the output of model training, as explained below). You can run the following script (while inside: /path-to-local-repository/bin) to generate DTI predictions for CHEMBL286 (renin) and the compounds in the sample compounds file:

python loadDEEPScreenModel.py  CHEMBL286 CNNModel_CHEMBL286_adam_0.0005_15_256_0.6_True-525 sample_test_compound_file.txt 0.83

Output of the script:

The script provides compound identifiers (as stated in the input test compounds file), which are predicted as active (i.e., interacting) for the corresponding target (CHEMBL286 in our example):

ACTIVE PREDICTIONS:CHEMBL286
CHEMBL1825183
CHEMBL302984
CHEMBL3143484
CHEMBL431854
CHEMBL88356
CHEMBL3400431

The expected prediction run time for the example pre-trained model on the provided sample input dataset on a "normal" desktop computer is around 10 seconds. Prediction run times are roughly linearly correlated with the number of input compounds. There is no typical install time for the pre-trained models as they are ready to use. The only requirement is their download from (here). The download time will depend on the connection speed and the model file sizes.

DEEPScreen_Largescale_DTI_predictions.zip file contains the results of the DTI prediction run, where DEEPScreen targets were scanned against more than 1 million compound records in ChEMBL, as described above.

How to train a target-based DEEPScreen model

Important note: Since highly optimized pre-trained models are already provided (here), the user is not required to do any model training.

Step-by-step operation:

  1. Install the listed dependencies

  2. Clone the DEEPScreen repository (large files under the trainingFiles folder: 'act_inact_comps_10.0_20.0_chembl_preprocessed_sp_b_pchembl_data_blast_comp_20.txt', 'chembl_23_chemreps.txt.zip' and 'Lenselink_Dataset_Files.zip' cannot be downloaded directly when the repository is cloned, these files should be downloaded and placed in the local trainingFiles folder manually)

  3. Decompress the zipped files

  4. Run DEEPScreen script by providing values for the following command line arguments:

    • The selected DNN architecture (ImageNetInceptionV2 or CNNModel)
    • target ChEMBL ID
    • The optimizer type (adam, momentum or rmsprop)
    • The learning rate
    • The number of epochs
    • The number of neurons in the first fully-connected layer
    • The number of neurons in the second fully-connected layer
    • The drop-out keep rate
    • Save model (should be 1 to save the model or 0 for not saving)

To train a model using the same hyper-parameter value selections as DEEPScreen, you can use the hyper-parameter values given in the file: deepscreen_models_hyperparameters_performance_results.tsv, which is located under the resultFiles folder. Below is a sample command to train a predictive model for renin protein whose ChEMBL ID is CHEMBL286:

python trainDEEPScreen.py CNNModel CHEMBL286 adam 0.0005 15 256 0 0.6 1

Output of the script:

The performance evaluation results and the specific predictions for the compounds in the independent test set are given as the output. After that, score cut-off (threshold) value that yield the best performance is given. Saving the threshold value is critical since this value should be given as an input argument while using the trained model for inference/prediction. In the last line, the predictions for the test compounds are written in tab-separated format, where each field is separated by commas as:

An example output of the command above:

Test AUC:0.9251733703190015
Test AUPRC:0.9372649744647131
Test_f1score:0.89
Test_mcc:0.74
Test_accuracy:0.88
Test_precision:0.91
Test_recall:0.82
Test_tp:181                
Test_fp:18
Test_tn:122
Test_fn:25
Best_threshold:0.8299999999999998
CHEMBL1934285,TN,INACT  CHEMBL61236,TN,INACT    CHEMBL3127099,TN,INACT  CHEMBL406475,TP,ACT     CHEMBL266334,TP,ACT, ...

The expected training run time for the example model on the provided training dataset (with the given hyper-parameters) on a "normal" desktop computer is around 10 minutes. Training run times can dramatically change from a few minutes to several days on a "normal" desktop computer according to the selected hyper-parameters and the chosen DNN architecture (i.e., in-house CNN or the Inception network). Training run times can also be considered as install times for the DEEPScreen models.

It is possible to observe a difference in performance measures (compared to the reported model performances) within a 10% range due to both random starting of weights at the beginning of each training run and the random split of train/test instances.

How to re-produce the results for DEEPScreen vs DL-based DTI predictors performance comparison

The name of the targets and the selected hyper-parameter values are available in the following files:

which are located under the resultsFiles folder.

Please first follow the step-by-step operation under the section 'How to train a target-based DEEPScreen model'.

python trainConvNetMUV.py CNNModel MUV_692 adam 0.001 15 128 0 0.8 0
python trainDEEPScreenDUDE.py ImageNetInceptionV2 hdac8 adam 0.0001 5 0 0 0.8 0
python trainDEEPScreenLenselink.py ImageNetInceptionV2 CHEMBL274 adam 0.0001 5 0 0 0.8 0

The output of these commads are same as the output of the script shown above. Please note that you should unzip the corresponding folders (DUDEDatasetFiles.zip, MUVDatasetFiles.zip or Lenselink_Dataset_Files.zip) before running the training scripts. It is possible to observe differences in performance measures (compared to the reported model performances) due to the fact that negative training samples are selected randomly (equal to the number of positive samples) at the beginning of each run, and active/inactive sets are extremely unbalanced (the number of negatives are highher) in MUV and DUD-E datasets, which results in the selection of only a small portion of negative samples during training.

License

DEEPScreen Copyright (C) 2019 CanSyL

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.