Awesome
HIGH-PPI
Hierarchical Graph Learning for Protein-Protein Interaction
Dependencies
HIGH-PPI runs on Python 3.7-3.9. To install all dependencies, directly run:
cd HIGH-PPI-main
conda env create -f environment.yml
conda activate HIGH-PPI
Download the following whl files to ./file/
: torch-scatter, torch-sparse, torch-cluster, torch-spline-conv.
cd ./file
pip install torch_scatter-2.0.9-cp39-cp39-linux_x86_64.whl
pip install torch_sparse-0.6.13-cp39-cp39-linux_x86_64.whl
pip install torch_cluster-1.6.0-cp39-cp39-linux_x86_64.whl
pip install torch_spline_conv-1.2.1-cp39-cp39-linux_x86_64.whl
pip install torch-geometric
Datasets
Three datasets (SHS27k, SHS148k and STRING) can be downloaded from the Google Drive:
protein.actions.SHS27k.STRING.pro2.txt
PPI network of SHS27kprotein.SHS27k.sequences.dictionary.pro3.tsv
Protein sequences of SHS27kprotein.actions.SHS148k.STRING.txt
PPI network of SHS148kprotein.SHS148k.sequences.dictionary.tsv
Protein sequences of SHS148k9606.protein.action.v11.0.txt
PPI network of STRINGprotein.STRING_all_connected.sequences.dictionary.tsv
Protein sequences of STRINGedge_list_12
Adjacency matrix for all proteins in SHS27kx_list
Feature matrix for all proteins in SHS27k
PPI Prediction
Example: predicting unknown PPIs in SHS27k datasets with native structures:
Using Processed Data for SHS27k Dataset
Download protein.actions.SHS27k.STRING.pro2.txt
, protein.SHS27k.sequences.dictionary.pro3.tsv
, edge_list_12
, x_list
and vec5_CTC.txt
to ./HIGH-PPI-main/protein_info/
.
Data Processing for New Datasets (if applicable)
Prepare all related PDB files. Native protein structures can be downloaded in batches from the RCSB PDB, and predicted protein structures with errors can be downloaded from the AlphaFold database. Put all of the PDB files in ./protein_info/
.
Generate adjacency matrix with native PDB files:
python ./protein_info/generate_adj.py --distance 12
Generate feature matrix:
python ./protein_info/generate_feat.py
Training
To predict PPIs, use 'model_train.py' script to train HIGH-PPI with the following options:
ppi_path
str, PPI network informationpseq_path
str, Protein sequencesp_feat_matrix
str, The feature matrix of all protein graphsp_adj_matrix
str, The adjacency matrix of all protein graphssplit
str, Dataset split modesave_path
str, Path for saving models, configs and results- 'epoch_num' int, Training epochs
python model_train.py --ppi_path ./protein_info/protein.actions.SHS27k.STRING.pro2.txt --pseq ./protein_info/protein.SHS27k.sequences.dictionary.pro3.tsv --split random --p_feat_matrix ./protein_info/x_list.pt --p_adj_matrix ./protein_info/edge_list_12.npy --save_path ./result_save --epoch_num 500
Testing
Run 'model_test.py' script to test HIGH-PPI with the following options:
ppi_path
str, PPI network informationpseq_path
str, Protein sequencesp_feat_matrix
str, The feature matrix of all protein graphsp_adj_matrix
str, The adjacency matrix of all protein graphsmodel_path
str, Path for trained modelindex_path
str, Path for index being tested
python model_test.py --ppi_path ./protein.actions.SHS27k.STRING.pro2.txt --pseq ./protein.SHS27k.sequences.dictionary.pro3.tsv --p_feat_matrix ./x_list.pt --p_adj_matrix ./edge_list_12.npy --model_path ./result_save/gnn_training_seed_1/gnn_model_valid_best.ckpt --index_path ./train_val_split_data/train_val_split_1.json
Output
The output after running 'model_test.py' includes:
valid_label_list
Real PPI labels for the test indextest_pre_result_list
Predicted PPI results for the test indexbest_f1
Overall performance in terms of best-F1 scoreaupr
Performance in terms of AUPR score for all seven PPI types (reaction, binding, ptmod, activation, inhibition, catalysis and expression)