Home

Awesome

<div align="center">

DeepRefine

Paper DOI

<img src="https://twixes.gallerycdn.vsassets.io/extensions/twixes/pypi-assistant/1.0.3/1589834023190/Microsoft.VisualStudio.Services.Icons.Default" width="50"/>

EGR_Architecture.png

Refinement_Example.png

</div>

Description

A geometric deep learning pipeline for refining and assessing protein complex structures, introducing the new EGR model. EGR is an attention-based E(3)-equivariant graph neural network for end-to-end protein complex structure refinement and quality assessment of all-atom and Cα-atom protein graphs. EGR achieves significant computational speed-ups and better or competitive results compared to current baseline methods. If you have any questions or suggestions, please contact us at acmwhb@umsystem.edu. We would be happy to help!

Citing this work

If you use the code or data associated with this package or find our work helpful, please cite:

@article{morehead2022egr,
  title = {EGR: Equivariant Graph Refinement and Assessment of 3D Protein Complex Structures},
  author = {Alex Morehead and Xiao Chen and Tianqi Wu and Jian Liu and Jianlin Cheng},
  year = {2022},
  eprint = {N/A},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG}
}

Repository Directory Structure

DeepRefine
│
└───docker
│
└───img
│
└───project
│     │
│     └───checkpoints
│     │   │
│     │   └───EGR_All_Atom_Models
│     │   │
│     │   └───EGR_Ca_Atom_Models
│     │   │
│     │   └───SEGNN_Ca_Atom_Models
│     │
│     └───datasets
│     │   │
│     │   └───Input
│     │   │
│     │   └───Output
│     │   │
│     │   └───RG
│     │   │
│     │   └───Test_Input
│     │   │
│     │   └───Test_Output
│     │
│     └───modules
│     │   │
│     │   └───egr
│     │   │
│     │   └───segnn
│     │   │
│     │   └───set
│     │   deeprefine_lit_modules.py
│     │
│     └───utils
│     │   │
│     │   └───egr
│     │   │
│     │   └───segnn
│     │   │
│     │   └───set
│     │   deeprefine_constants.py
│     │   deeprefine_utils.py
│     │
│     lit_model_predict.py
│     lit_model_predict_docker.py
│
└───tests
.gitignore
citation.bib
CONTRIBUTING.md
environment.yml
LICENSE
README.md
requirements.txt
setup.cfg
setup.py

Datasets

Our benchmark datasets (PSR Test, Benchmark 2, and M4S Test) can be downloaded as such:

wget https://zenodo.org/record/6570660/files/DeepRefine_Benchmark_Datasets.tar.xz

The refinement datasets contain:

  1. final/raw/pred directory: contains subdirectories of decoy structure PDB files
  2. final/raw/true directory: contains subdirectories of native structure PDB files

The quality assessment dataset contains:

  1. target directories: each contain decoy structure PDB files corresponding to a given protein target
  2. label_info.csv: a CSV listing each decoy structure's DockQ score and CAPRI class label

Inference Pipeline Directory Structure

An example of the input dataset directory structure our inference pipeline expects, as well as the output directory structure it will produce, is as follows:

DeepRefine
│
└───docker
│
└───img
│
└───project
      │
      └───checkpoints
      │   │
      │   └───EGR_All_Atom_Models
      │   │
      │   └───EGR_Ca_Atom_Models
      │   │
      │   └───SEGNN_Ca_Atom_Models
      │
      └───datasets
          │
          └───Input
          │   │
          │   └───custom_decoy_dataset
          │         │
          │         └───7AMV
          │         │     │
          │         │     └───7AMV_[0-4].pdb  # Input decoy PDB files
          │         │     
          │        ...
          │         │
          │         │
          │         └───7OEL
          │               │
          │               └───7OEL_[0-4].pdb  # Input decoy PDB files
          │
          └───Output
              │
              └───custom_decoy_dataset
                    │
                    └───7AMV
                    │     │
                    │     └───7AMV_[0-4].dill  # Output protein dictionary pickle files containing DGLGraph objects
                    │     │
                    │     └───7AMV_[0-4].pdb  # Input decoy PDB files
                    │     │
                    │     └───7AMV_[0-4]_refined.pdb  # Output refined decoy PDB files
                    │     │
                    │     └───7AMV_[0-4]_refined_plddt.csv  # Output per-residue LDDT scores
                    │     
                   ...
                    │
                    │
                    └───7OEL
                          │
                          └───7OEL_[0-4].dill  # Output protein dictionary pickle files containing DGLGraph objects
                          │
                          └───7OEL_[0-4].pdb  # Input decoy PDB files
                          │
                          └───7OEL_[0-4]_refined.pdb  # Output refined decoy PDB files
                          │
                          └───7OEL_[0-4]_refined_plddt.csv  # Output per-residue LDDT scores
           

Running DeepRefine via Docker

The simplest way to run DeepRefine is using the provided Docker script.

The following steps are required in order to ensure Docker is installed and working correctly:

  1. Install Docker.

  2. Check that DeepRefine will be able to use a GPU by running:

    docker run --rm --gpus all nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04 nvidia-smi
    

    The output of this command should show a list of your GPUs. If it doesn't, check if you followed all steps correctly when setting up the NVIDIA Container Toolkit or take a look at the following NVIDIA Docker issue.

Now that we know Docker is functioning properly, we can begin building our Docker image for DeepRefine:

  1. Clone this repository and cd into it.

    git clone https://github.com/BioinfoMachineLearning/DeepRefine
    cd DeepRefine/
    DR_DIR=$(pwd)
    
  2. Build the Docker image (Warning: Requires ~16GB of Space). To enable optional support for models operating on Ca atom graphs, substitute your license key for Modeller within the following Dockerfile:

    docker build -f docker/Dockerfile -t deeprefine .
    
  3. Install the run_docker.py dependencies. Note: You may optionally wish to create a Python Virtual Environment to prevent conflicts with your system's Python environment.

    pip3 install -r docker/requirements.txt
    
  4. Run run_docker.py pointing to an input PDB directory containing all decoy structures for a protein target for which you wish to predict refined structures and qualities. Importantly, below are configurations for model inference. Select one and copy/paste it into your terminal session. For example, for the RCSB test target with the PDB ID 6GS2:

    # Settings for predicting refined structures and per-residue quality using all-atom graphs and an EGR model
    # (Note: Best model overall for refinement and QA) #
    ckpt_dir="$DR_DIR"/project/checkpoints/EGR_All_Atom_Models
    ckpt_name=LitPSR_EGR_AllAtomModel1_Seed42.ckpt
    atom_selection_type=all_atom
    seed=42
    nn_type=EGR
    graph_return_format=dgl
    
    # To predict refined structures and per-residue quality using Ca-atom graphs and an EGR model
    # (Note: Best model for balanced QA results) #
    ckpt_dir="$DR_DIR"/project/checkpoints/EGR_Ca_Atom_Models
    ckpt_name=LitPSR_EGR_CaAtomModel1_Seed32.ckpt
    atom_selection_type=ca_atom
    seed=32
    nn_type=EGR
    graph_return_format=dgl
    
    # To predict refined structures and per-residue quality using Ca-atom graphs and an SEGNN model
    # (Note: Best model for QA ranking loss) #
    ckpt_dir="$DR_DIR"/project/checkpoints/SEGNN_Ca_Atom_Models
    ckpt_name=LitPSR_SEGNN_CaAtomModel_Seed42.ckpt
    atom_selection_type=ca_atom
    seed=42
    nn_type=SEGNN
    graph_return_format=pyg
    

    Refine atom positions and predict per-residue LDDT scores:

    python3 docker/run_docker.py --perform_pos_refinement --num_gpus 1 --num_workers 1 --input_dataset_dir "$DR_DIR"/project/datasets/Test_Input/Test_Target/ --output_dir "$DR_DIR"/project/datasets/Test_Output/Test_Target/ --ckpt_dir "$ckpt_dir" --ckpt_name "$ckpt_name" --atom_selection_type "$atom_selection_type" --seed "$seed" --nn_type "$nn_type" --graph_return_format "$graph_return_format"
    

    Or, solely predict per-residue LDDT scores (for faster inference times with Ca atom models):

    python3 docker/run_docker.py --num_gpus 1 --num_workers 1 --input_dataset_dir "$DR_DIR"/project/datasets/Test_Input/Test_Target/ --output_dir "$DR_DIR"/project/datasets/Test_Output/Test_Target/ --ckpt_dir "$ckpt_dir" --ckpt_name "$ckpt_name" --atom_selection_type "$atom_selection_type" --seed "$seed" --nn_type "$nn_type" --graph_return_format "$graph_return_format"
    

    This script will generate and (as PDB files - e.g., datasets/Test_Output/Test_Target/6GS2/6GS2_refined.pdb) save to the given output directory refined PDB structures as well as the chosen equivariant graph neural network's predictions of per-residue structural quality.

  5. Note that by using the default

    --num_gpus 0
    

    flag when executing run_docker.py, the Docker container will only make use of the system's available CPU(s) for prediction. However, by specifying

    --num_gpus 1
    

    when executing run_docker.py, the Docker container will then employ the first available GPU for prediction.

  6. Also, note that protein dictionary files (e.g., 6GS2.dill) created outside of the Docker inference pipeline are not compatible with the Docker inference pipeline and must be re-processed from scratch.

Running DeepRefine via a Traditional Installation (for Linux-Based Operating Systems)

First, install and configure Conda environment:

# Clone this repository:
git clone https://github.com/BioinfoMachineLearning/DeepRefine

# Change to project directory:
cd DeepRefine
DR_DIR=$(pwd)

# Set up Conda environment locally
conda env create --name DeepRefine -f environment.yml

# Activate Conda environment located in the current directory:
conda activate DeepRefine

# Explicitly install DGL 0.8.0post1 (CUDA 11.3) with Conda
conda install -c dglteam https://anaconda.org/dglteam/dgl-cuda11.3/0.8.0post1/download/linux-64/dgl-cuda11.3-0.8.0post1-py38_0.tar.bz2

# Explicitly install latest version of BioPython with pip
pip3 install git+https://github.com/biopython/biopython@1dd950aec08ed3b63d454fea662697f6949f8dfa

# (Optional) To enable support for models operating on Ca atom graphs, substitute XXXX with your license key for Modeller:
sed -i '2s/.*/license = r\x27'XXXX'\x27/' ~/anaconda3/envs/DeepRefine/lib/modeller-10.2/modlib/modeller/config.py

# (Optional) Perform a full install of the pip dependencies described in 'requirements.txt':
pip3 install -e .

# (Optional) To remove the long Conda environment prefix in your shell prompt, modify the env_prompt setting in your .condarc file with:
conda config --set env_prompt '({name})'

Inference

Predict refined structures and their per-residue quality

Navigate to the project directory and run the prediction script with the filename of the input PDB, containing all chains.

# Navigate to project directory 
cd "$DR_DIR"/project

Configurations for model inference (Select one and copy/paste it into your terminal session):

# Settings for predicting refined structures and per-residue quality using all-atom graphs and an EGR model
# (Note: Best model overall for refinement and QA) #
ckpt_dir="$DR_DIR"/project/checkpoints/EGR_All_Atom_Models
ckpt_name=LitPSR_EGR_AllAtomModel1_Seed42.ckpt
atom_selection_type=all_atom
seed=42
nn_type=EGR
graph_return_format=dgl
# To predict refined structures and per-residue quality using Ca-atom graphs and an EGR model
# (Note: Best model for balanced QA results) #
ckpt_dir="$DR_DIR"/project/checkpoints/EGR_Ca_Atom_Models
ckpt_name=LitPSR_EGR_CaAtomModel1_Seed32.ckpt
atom_selection_type=ca_atom
seed=32
nn_type=EGR
graph_return_format=dgl
# To predict refined structures and per-residue quality using Ca-atom graphs and an SEGNN model
# (Note: Best model for QA ranking loss) #
ckpt_dir="$DR_DIR"/project/checkpoints/SEGNN_Ca_Atom_Models
ckpt_name=LitPSR_SEGNN_CaAtomModel_Seed42.ckpt
atom_selection_type=ca_atom
seed=42
nn_type=SEGNN
graph_return_format=pyg

Decide whether to predict per-residue LDDT scores and refine atom positions or to instead solely predict per-residue LDDT scores (for faster inference times with Ca atom models). To predict refined positions and LDDT scores, include the flag:

--perform_pos_refinement

Make predictions:

# Hint: Run `python3 lit_model_predict.py --help` to see all available CLI arguments
python3 lit_model_predict.py --perform_pos_refinement --device_type gpu --num_devices 1 --num_compute_nodes 1 --num_workers 1 --batch_size 1 --input_dataset_dir "$DR_DIR"/project/datasets/Test_Input/Test_Target/ --output_dir "$DR_DIR"/project/datasets/Test_Output/Test_Target/ --ckpt_dir "$ckpt_dir" --ckpt_name "$ckpt_name" --atom_selection_type "$atom_selection_type" --seed "$seed" --nn_type "$nn_type" --graph_return_format "$graph_return_format"

This script will generate and (as PDB files - e.g., datasets/Test_Output/Test_Target/6GS2/6GS2_refined.pdb) save to the given output directory refined PDB structures as well as the chosen equivariant graph neural network's predictions of per-residue structural quality.

Also, note that protein dictionary files (e.g., 6GS2.dill) created outside of the traditional inference pipeline are not compatible with the traditional inference pipeline and must be re-processed from scratch.

Main Results

The following three tables show EGR's consistent best or competitive results on all test datasets in terms of DockQ refinement metrics, QA ranking performance, and QA ranking loss. The best results are highlighted in bold.

Refinement Results

Table 1: Performance of different refinement methods on each test dataset.

ΔMetricDockQ ↑iRMSD ↓LRMSD ↓FI-DockQ ↑API-DockQ ↑
<u>PSR-Dockground (4,799)</u>
Modeller+0.0002-0.6331-1.002763.03%0.32%
EGR-Cα-Modeller+0.0053 ± 0.0011-1.2285 ± 0.0330-3.5226 ± 0.312579.30% ± 0.93%0.89% ± 0.15%
SET-AllAtom+0.0132 ± 0.0040-0.8808 ± 0.1158-1.6478 ± 0.104784.90% ± 1.13%1.69 ± 0.35%
SEGNN-AllAtom+0.0144 ± 0.0024-2.4562 ± 0.049-6.6603 ± 0.670294.46% ± 0.60%1.89% ± 0.29%
<u>EGR-AllAtom</u>+0.0097 ± 0.0002-0.6274 ± 0.0669-2.5561 ± 0.158483.66% ± 0.49%1.59% ± 0.11%
<u>PSR-DeepHomo (376)</u>
Modeller-0.2465+1.5912+5.34578.24%0.53%
EGR-Cα-Modeller-0.2796 ± 0.0055+2.2075 ± 0.0839+6.1711 ± 0.18428.16% ± 0.76%1.17% ± 0.18%
SET-AllAtom-0.0034 ± 0.0003+0.0275 ± 0.0050+0.0273 ± 0.010427.39% ± 4.36%0.20% ± 0.08%
SEGNN-AllAtom-0.0468 ± 0.0091+0.2950 ± 0.0741+0.3593 ± 0.172216.31% ± 3.54%0.87% ± 0.20%
<u>EGR-AllAtom</u>-0.0006 ± 0.0018+0.0121 ± 0.0054+0.0013 ± 0.002845.12% ± 6.99%0.41% ± 0.03%
<u>PSR-EVCoupling (195)</u>
Modeller-0.1738+1.1467+4.98777.18%0.74%
EGR-Cα-Modeller-0.2150 ± 0.0073+1.9651 ± 0.0647+5.8477 ± 0.77599.91% ± 1.74%1.49% ± 0.37%
SET-AllAtom-0.0016 ± 0.0002+0.0149 ± 0.0007+0.0108 ± 0.004027.86% ± 5.24%0.31% ± 0.11%
SEGNN-AllAtom-0.0250 ± 0.0069+0.1646 ± 0.0633+0.2400 ± 0.104418.29% ± 3.41%0.89% ± 0.18%
<u>EGR-AllAtom</u>+0.0010 ± 0.0010+0.0026 ± 0.0031-0.0059 ± 0.001743.93% ± 5.00%0.48% ± 0.03%
<u>Benchmark 2 (17)</u>
Modeller-0.1855+0.7939+3.02775.88%0.60%
GalaxyRefineComplex-0.0074+0.0778-0.024622.22%2.12%
EGR-Cα-Modeller-0.2644 ± 0.0437+2.118 ± 0.7832+5.9196 ± 1.858915.69% ± 2.77%1.28% ± 0.84%
SET-AllAtom-0.0078 ± 0.0015+0.0729 ± 0.0186+0.0469 ± 0.011429.63% ± 2.62%0.33% ± 0.14%
SEGNN-AllAtom-0.0328 ± 0.0062+0.0807 ± 0.0790+0.0781 ± 0.137131.37% ± 5.54%1.24% ± 0.59%
<u>EGR-AllAtom</u>-0.0010 ± 0.0028-0.0002 ± 0.003-0.0121 ± 0.002143.14% ± 10.00%0.59% ± 0.08%

Structure Quality Assessment (QA) Results

Table 2: Hit rate performance of different QA methods on the M4S test dataset.

IDEGR-Cα-ModellerSET-AllAtomSEGNN-AllAtom<u>EGR-AllAtom</u>GNN_DOVETop-10 Best
7AOH10/10/69/8/69/9/99/9/99/9/010/10/10
7D7F0/0/02/0/00/0/00/0/00/0/05/0/0
7AMV10/10/810/10/510/10/910/10/510/10/610/10/10
7OEL10/10/010/10/010/9/010/9/010/10/010/10/0
7O2810/10/010/10/010/10/010/10/010/10/010/10/0
7MRW6/5/00/0/00/0/00/0/00/0/010/10/0
7D3Y0/0/00/0/00/0/01/0/00/0/010/0/0
7NKZ10/10/910/9/910/10/310/9/910/9/910/10/10
7LXT10/10/04/3/06/5/08/7/01/0/010/10/0
7KBR10/10/1010/10/1010/10/1010/10/910/10/910/10/10
7O2710/5/010/7/010/6/010/4/010/4/010/10/0
Summary9/9/49/8/48/8/49/8/48/7/311/9/4

Table 3: Ranking loss of different QA methods on the M4S test dataset.

IDEGR-Cα-ModellerSET-AllAtomSEGNN-AllAtom<u>EGR-AllAtom</u>GNN_DOVE
7AOH0.06100.92800.92800.03500.9280
7D7F0.47000.47000.47100.45900.0030
7AMV0.17300.34200.01300.34200.3420
7OEL0.21000.21000.37900.21000.2100
7O280.23300.02400.27400.24400.2440
7MRW0.60000.55500.60300.55500.5980
7D3Y0.32400.29500.17400.29500.2950
7NKZ0.02200.11000.18300.45900.4590
7LXT0.05000.29500.29500.38900.2950
7KBR0.17000.15200.05200.15200.0680
7O270.33400.33400.36500.31800.3340
Summary0.2406 ± 0.18010.3377 ± 0.24860.3397 ± 0.26130.3144 ± 0.15060.3432 ± 0.2538

Train EGR models using Custom Datasets

We plan to release our training code and datasets soon.

Acknowledgements

DeepRefine communicates with and/or references the following separate libraries and packages:

We thank all their contributors and maintainers!

License and Disclaimer

Copyright 2022 University of Missouri-Columbia Bioinformatics & Machine Learning (BML) Lab.

DeepRefine Code License

Licensed under the GNU Public License, Version 3.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.gnu.org/licenses/gpl-3.0.en.html.

Third-party software

Use of the third-party software, libraries or code referred to in the Acknowledgements section above may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.