Home

Awesome

DeepAffinity: Intro

Drug discovery demands rapid quantification of compound-protein interaction (CPI). However, there is a lack of methods that can predict compound-protein affinity from sequences alone with high applicability, accuracy, and interpretability. We present a integration of domain knowledges and learning-based approaches. Under novel representations of structurally-annotated protein sequences, a semi-supervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting affinities. Performances for new protein classes with few labeled data are further improved by transfer learning. Furthermore, novel attention mechanisms are developed and embedded to our model to add to its interpretability. Lastly, alternative representations using protein sequences or compound graphs and a unified RNN/GCNN-CNN model using graph CNN (GCNN) are also explored to reveal algorithmic challenges ahead.

DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks

(What has happened since DeepAffinity? (Part 1) Explainable Deep Relational Networks for Predicting Compound–Protein Affinities and Contacts (2020).

(What has happened since DeepAffinity? (Part 2) Cross-Modality and Self-Supervised Protein Embedding (2022)

Training DeepAffinity: Illustration

Training-Process

Pre-requisite:

conda env create -n envname -f environment.yml

Table of contents:

Testing the model

To test DeepAffinity for new dataset, please follow the steps below:

You may use the script to run our model in one command. The details can be found in our manual (last updated: Apr. 9, 2020).

(Apr. 27, 2021) If the number of testing pairs in the input is below 64 (batch size), the script returns an error (InvalidArgumentError ... ConCat0p : Dimensions of inputs should match: ...). Such rigidity is unfortunately due to TFLEARN. An easy get around is to "pad" the input file to reach at least 64 pairs, using arbitrary compound-protein inputs.

(Aug. 21, 2020) We are now providing SPS (Structure Property-annotated Sequence) for all human proteins! zip (Credit: Dr. Tomas Babak at Queens University). Columns: 1. Gene identifier 2. Protein FASTA 3. SS (Scratch) 4. SS8 (Scratch) 5. acc (Scratch) 6. acc20 7. SPS

P.S. Considering the distribution of protein sequence lengths in our training data, our trained checkpoints are recommended for proteins of lengths between tens and 1500.

Re-training the seq2seq models for new dataset:

(Added on Jan. 18, 2021) To re-train the seq2seq models for new dataset, please follow the steps below:

Note:

We recommend referring to PubChem for canonical SMILES for compounds.

Citation:

If you find the code useful for your research, please consider citing our paper:

@article{karimi2019deepaffinity,
  title={DeepAffinity: interpretable deep learning of compound--protein affinity through unified recurrent and convolutional neural networks},
  author={Karimi, Mostafa and Wu, Di and Wang, Zhangyang and Shen, Yang},
  journal={Bioinformatics},
  volume={35},
  number={18},
  pages={3329--3338},
  year={2019},
  publisher={Oxford University Press}
}

Contacts:

Yang Shen: yshen@tamu.edu

Di Wu: wudi930325@gmail.com

Mostafa Karimi: mostafa_karimi@tamu.edu