Awesome
DeepRank2
Badges | |
---|---|
fairness | |
package | |
docs | |
tests | <!--- ![Python](https://img.shields.io/badge/python-3.11-blue.svg) --> |
running on | |
license |
Overview
<p align="center"> <img src="./deeprank2.png" width="700"> </p>DeepRank2 is an open-source deep learning (DL) framework for data mining of protein-protein interfaces (PPIs) or single-residue variants (SRVs). This package is an improved and unified version of three previously developed packages: DeepRank, DeepRank-GNN, and DeepRank-Mut.
As input, DeepRank2 takes PDB-formatted atomic structures, and map them to graphs, where nodes can represent either residues or atoms, as chosen by the user, and edges represent the interactions between them. DeepRank2 has the option to choose between two types of queries as input for the featurization phase:
- PPIs, for mining interaction patterns within protein-protein complexes, implemented by the
ProteinProteinInterfaceQuery
class; - SRVs, for mining mutation phenotypes within protein structures, implemented by the
SingleResidueVariantQuery
class.
The physico-chemical and geometrical features are then computed and assigned to each node and edge. The user can choose which features to generate from several pre-existing options defined in the package, or define custom features modules, as explained in the documentation. The graphs can then be mapped to 3D-grids as well. The generated data can be used for training neural networks. DeepRank2 also offers a pre-implemented training pipeline, using either CNNs (for 3D-grids) or GNNs (for graphs), as well as output exporters for evaluating performances.
Main features:
- Predefined atom-level and residue-level feature types
- e.g. atom/residue type, charge, size, potential energy
- All features' documentation is available here
- Predefined target types
- binary class, CAPRI categories, DockQ, RMSD, and FNAT
- Detailed docking scores documentation is available here
- Flexible definition of both new features and targets
- Features generation for both graphs and 3D-grids
- Efficient data storage in HDF5 format
- Support for both classification and regression (based on PyTorch and PyTorch Geometric)
📚 Documentation
📣 Discussions
Table of Contents
- DeepRank2
Installation
There are two ways to install DeepRank2:
- In a dockerized container. This allows you to use DeepRank2, including all the notebooks within the container (a protected virtual space), without worrying about your operating system or installation of dependencies.
- We recommend this installation for inexperienced users and to learn to use or test our software, e.g. using the provided tutorials. However, resources might be limited in this installation and we would not recommend using it for large datasets or on high-performance computing facilities.
- Local installation on your system. This allows you to use the full potential of DeepRank2, but requires a few additional steps during installation.
- We recommend this installation for more experienced users, for larger projects, and for (potential) contributors to the codebase.
Containerized Installation
We provide a pre-built Docker image hosted on GitHub Packages, allowing you to use DeepRank2 without worrying about installing dependencies or configuring your system. This is the recommended method for trying out the package quickly.
Pull and Run the Pre-build Docker Image (Recommended)
- Install Docker on your system, if not already installed.
- Pull the latest Docker image from GitHub Packages by running the following command:
docker pull ghcr.io/deeprank/deeprank2:latest
- Run the container from the pulled image:
docker run -p 8888:8888 ghcr.io/deeprank/deeprank2:latest
- Once the container is running, open your browser and navigate to
http://localhost:8888
to access the DeepRank2 application.
From here, you can use DeepRank2, including running the tutorial notebooks. More details about the tutorials can be found here. Note that the Docker container downloads only the raw PDB files required for the tutorials. To generate processed HDF5 files, you will need to run the data_generation_xxx.ipynb
notebooks. Since Docker containers may have limited memory resources, we reduce the number of data points processed in the tutorials. To fully utilize the package, consider installing it locally.
Build the Docker Image Manually
If you prefer to build the Docker image yourself or run into issues with the pre-built image, you can manually build and run the container as follows:
- Install Docker on your system, if not already installed.
- Clone the DeepRank2 repository and navigate to its root directory:
git clone https://github.com/DeepRank/deeprank2
cd deeprank2
- Build and run the Docker image:
docker build -t deeprank2 .
docker run -p 8888:8888 deeprank2
- Once the container is running, open your browser and navigate to
http://localhost:8888
to access the DeepRank2 application.
Removing the Docker Image
If you no longer need the Docker image (which can be quite large), you can remove it after stopping the container. Follow the container stop and remove the image instructions. For more general information on Docker, refer to the Docker documentation directly.
Local/remote Installation
Local installation is formally only supported on the latest stable release of ubuntu, for which widespread automated testing through continuous integration workflows has been set up. However, it is likely that the package runs smoothly on other operating systems as well.
Before installing DeepRank2 please ensure you have GCC installed: if running gcc --version
gives an error, run sudo apt-get install gcc
.
YML File Installation (Recommended)
You can use the provided YML file for creating a conda environment via mamba, containing the latest stable release of DeepRank2 and all its dependencies. This will install the CPU-only version of DeepRank2 on Python 3.10. Note that this will not work for MacOS. Do the Manual Installation instead.
# Create the environment
mamba env create -f https://raw.githubusercontent.com/DeepRank/deeprank2/main/env/deeprank2.yml
# Activate the environment
conda activate deeprank2
# Install the latest deeprank2 release
pip install deeprank2
We also provide a frozen environment YML file located at env/deeprank2_frozen.yml
with all dependencies set to fixed versions. The env/deeprank2_frozen.yml
file provides a frozen environment with all dependencies set to fixed versions. This ensures reproducibility of experiments and results by preventing changes in package versions that could occur due to updates or modifications in the default env/deeprank2.yml
. Use this frozen environment file for a stable and consistent setup, particularly if you encounter issues with the default environment file.
Manual Installation (Customizable)
If you want to use the GPUs, choose a specific python version (note that at the moment we support python 3.10 only), are a MacOS user, or if the YML installation was not successful, you can install the package manually. We advise to do this inside a conda virtual environment.
You can first create a copy of the deeprank2.yml
file, place it in your current directory, and remove the packages that cannot be installed properly, or the ones that you want to install differently (e.g., pytorch-related packages if you wish to install the CUDA version), and then proceed with the environment creation by using the edited YML file: conda env create -f deeprank2.yml
or mamba env create -f deeprank2.yml
, if you have mamba installed. Then activate the environment, and proceed with installing the missing packages, which might fall into the following list. If you have any issues during installation of dependencies, please refer to the official documentation for each package (linked below), as our instructions may be out of date (last tested on 19 Feb 2024):
- MSMS: Here for MacOS with M1 chip users.
- PyTorch
- Pytorch regularly publishes updates and not all newest versions will work stably with DeepRank2. Currently, the package is tested on ubuntu using PyTorch 2.1.1.
- We support torch's CPU library as well as CUDA.
- PyG and its optional dependencies:
torch_scatter
,torch_sparse
,torch_cluster
,torch_spline_conv
.- The exact command to install pyg will depend on the version of pytorch you are using. Please refer to the source's installation instructions (we recommend using the pip installation for this as it also shows the command for the dependencies).
- FreeSASA.
Finally install deeprank2 itself: pip install deeprank2
.
Alternatively, get the latest updates by cloning the repo and installing the editable version of the package with:
git clone https://github.com/DeepRank/deeprank2
cd deeprank2
pip install -e .'[test]'
The test
extra is optional, and can be used to install test-related dependencies, useful during development.
Testing DeepRank2 Installation
If you have cloned the repository, you can check that all components were installed correctly using pytest
. We especially recommend doing this in case you installed DeepRank2 and its dependencies manually (the latter option above).
The quick test should be sufficient to ensure that the software works, while the full test (a few minutes) will cover a much broader range of settings to ensure everything is correct.
First run pip install pytest
, if you did not install it above. Then run pytest tests/test_integration.py
for the quick test or just pytest
for the full test (expect a few minutes to run).
Contributing
If you would like to contribute to the package in any way, please see our guidelines.
Using DeepRank2
The following section serves as a first guide to start using the package, using protein-protein Interface (PPI) queries as example. For an enhanced learning experience, we provide in-depth tutorial notebooks for generating PPI data, generating SVR data, and for the training pipeline. For more details, see the extended documentation.
Data Generation
For each protein-protein complex (or protein structure containing a missense variant), a Query
can be created and added to the QueryCollection
object, to be processed later on. Two subtypes of Query
exist: ProteinProteinInterfaceQuery
and SingleResidueVariantQuery
.
A Query
takes as inputs:
- A
.pdb
file, representing the molecular structure. - The resolution (
"residue"
or"atom"
), i.e. whether each node should represent an amino acid residue or an atom. chain_ids
, the chain ID or IDs (generally single capital letter(s)).SingleResidueVariantQuery
takes a single ID, which represents the chain containing the variant residue.ProteinProteinInterfaceQuery
takes a pair of ids, which represent the chains between which the interface exists.- Note that in either case this does not limit the structure to residues from this/these chain/s. The structure contained in the
.pdb
can thus have any number of chains, and residues from these chains will be included in the graphs and 3D-grids produced by DeepRank2 (if they are within theinfluence_radius
).
- Optionally, the correspondent position-specific scoring matrices (PSSMs), in the form of
.pssm
files.
from deeprank2.query import QueryCollection, ProteinProteinInterfaceQuery
queries = QueryCollection()
# Append data points
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_1w.pdb",
resolution = "residue",
chain_ids = ["A", "B"],
targets = {
"binary": 0
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_2w.pdb",
resolution = "residue",
chain_ids = ["A", "B"],
targets = {
"binary": 1
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_3w.pdb",
resolution = "residue",
chain_ids = ["A", "B"],
targets = {
"binary": 0
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
The user is free to implement a custom query class. Each implementation requires the build
method to be present.
The queries can then be processed into graphs only or both graphs and 3D-grids, depending on which kind of network will be used later for training.
from deeprank2.features import components, conservation, contact, exposure, irc, surfacearea
from deeprank2.utils.grid import GridSettings, MapMethod
feature_modules = [components, conservation, contact, exposure, irc, surfacearea]
# Save data into 3D-graphs only
hdf5_paths = queries.process(
"<output_folder>/<prefix_for_outputs>",
feature_modules = feature_modules)
# Save data into graphs and 3D-grids
hdf5_paths = queries.process(
"<output_folder>/<prefix_for_outputs>",
feature_modules = feature_modules,
grid_settings = GridSettings(
# the number of points on the x, y, z edges of the cube
points_counts = [20, 20, 20],
# x, y, z sizes of the box in Ã…
sizes = [1.0, 1.0, 1.0]),
grid_map_method = MapMethod.GAUSSIAN)
Datasets
Data can be split in sets implementing custom splits according to the specific application. Assuming that the training, validation and testing ids have been chosen (keys of the HDF5 file/s), then the DeeprankDataset
objects can be defined.
GraphDataset
For training GNNs the user can create a GraphDataset
instance:
from deeprank2.dataset import GraphDataset
node_features = ["bsa", "res_depth", "hse", "info_content", "pssm"]
edge_features = ["distance"]
target = "binary"
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
# Creating GraphDataset objects
dataset_train = GraphDataset(
hdf5_path = hdf5_paths,
subset = train_ids,
node_features = node_features,
edge_features = edge_features,
target = target
)
dataset_val = GraphDataset(
hdf5_path = hdf5_paths,
subset = valid_ids,
train_source = dataset_train
)
dataset_test = GraphDataset(
hdf5_path = hdf5_paths,
subset = test_ids,
train_source = dataset_train
)
GridDataset
For training CNNs the user can create a GridDataset
instance:
from deeprank2.dataset import GridDataset
features = ["bsa", "res_depth", "hse", "info_content", "pssm", "distance"]
target = "binary"
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
# Creating GraphDataset objects
dataset_train = GridDataset(
hdf5_path = hdf5_paths,
subset = train_ids,
features = features,
target = target
)
dataset_val = GridDataset(
hdf5_path = hdf5_paths,
subset = valid_ids,
train_source = dataset_train,
)
dataset_test = GridDataset(
hdf5_path = hdf5_paths,
subset = test_ids,
train_source = dataset_train,
)
Training
Let's define a Trainer
instance, using for example of the already existing GINet
. Because GINet
is a GNN, it requires a dataset instance of type GraphDataset
.
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork
trainer = Trainer(
VanillaNetwork,
dataset_train,
dataset_val,
dataset_test
)
The same can be done using a CNN, for example CnnClassification
. Here a dataset instance of type GridDataset
is required.
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.cnn.model3d import CnnClassification
trainer = Trainer(
CnnClassification,
dataset_train,
dataset_val,
dataset_test
)
By default, the Trainer
class creates the folder ./output
for storing predictions information collected later on during training and testing. HDF5OutputExporter
is the exporter used by default, but the user can specify any other implemented exporter or implement a custom one.
Optimizer (torch.optim.Adam
by default) and loss function can be defined by using dedicated functions:
import torch
trainer.configure_optimizers(torch.optim.Adamax, lr = 0.001, weight_decay = 1e-04)
Then the Trainer
can be trained and tested; the best model in terms of validation loss is saved by default, and the user can modify so or indicate where to save it using the train()
method parameter filename
.
trainer.train(
nepoch = 50,
batch_size = 64,
validate = True,
filename = "<my_folder/model.pth.tar>")
trainer.test()
Run a Pre-trained Model on New Data
If you want to analyze new PDB files using a pre-trained model, the first step is to process and save them into HDF5 files as we have done above.
Then, the DeeprankDataset
instance for the newly processed data can be created. Do this by specifying the path for the pre-trained model in train_source
, together with the path to the HDF5 files just created. Note that there is no need of setting the dataset's parameters, since they are inherited from the information saved in the pre-trained model. Let's suppose that the model has been trained with GraphDataset
objects:
from deeprank2.dataset import GraphDataset
dataset_test = GraphDataset(
hdf5_path = "<output_folder>/<prefix_for_outputs>",
train_source = "<pretrained_model_path>"
)
Finally, the Trainer
instance can be defined and the new data can be tested:
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork
from deeprank2.utils.exporters import HDF5OutputExporter
trainer = Trainer(
VanillaNetwork,
dataset_test = dataset_test,
pretrained_model = "<pretrained_model_path>",
output_exporters = [HDF5OutputExporter("<output_folder_path>")]
)
trainer.test()
For more details about how to run a pre-trained model on new data, see the docs.
Computational Performances
We measured the efficiency of data generation in DeepRank2 using the tutorials' PDB files (~100 data points per data set), averaging the results run on Apple M1 Pro, using a single CPU.
Parameter settings were: atomic resolution, distance_cutoff
of 5.5 Ã…, radius (for SRV only) of 10 Ã…. The features modules used were components
, contact
, exposure
, irc
, secondary_structure
, surfacearea
, for a total of 33 features for PPIs and 26 for SRVs (the latter do not use irc
features).
Data processing speed <br />[seconds/structure] | Memory <br />[megabyte/structure] | |
---|---|---|
PPIs | graph only: 2.99 (std 0.23) <br />graph+grid: 11.35 (std 1.30) | graph only: 0.54 (std 0.07) <br />graph+grid: 16.09 (std 0.44) |
SRVs | graph only: 2.20 (std 0.08) <br />graph+grid: 2.85 (std 0.10) | graph only: 0.05 (std 0.01) <br />graph+grid: 17.52 (std 0.59) |
Package Development
If you're looking for developer documentation, go here.