Awesome

GraphSol

A Protein Solubility Predictor developed by Graph Convolutional Network and Predicted Contact Map

The source code for our paper Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map

0. Update(2022-03-01)

We have reimplemented the GraphSol model by using dgl, which have been optimized in training time and costing memory without losing the accuracy.

TODO

Merge the prediction workflow into the original workflow.
Batch size > 1 in the reimplemention.

1. Dependencies

The code has been tested under Python 3.7.9, with the following packages installed (along with their dependencies):

torch==1.6.0
numpy==1.19.1
scikit-learn==0.23.2
pandas==1.1.0
tqdm==4.48.2

2. How to retrain the GraphSol model and test?

If you want to reproduce our result, please refer to the steps below.

Step 1: Download all sequence features

Please go to the path ./Data/Feature Link.txt and download Node Features.zip and Edge Features.zip

Step 2: Decompress all `.zip` files

Please unzip 3 zip files and put them into the corresponding paths.

./Data/node_features.zip -> ./Data/node_features
./Data/edge_features.zip -> ./Data/edge_features
./Data/fasta.zip -> ./Data/fasta

Step 3: Run the training code

Run the following python script and it will take about 1 hour to train the model.

$ python Train.py

A trained model will be saved in the folder ./Model and validation results in the folder ./Result

Step 4: Run the test code

Run the following python script and it will be finished in a few seconds.

$ python Test.py

3. How to predict protein solubility by the pretrained GraphSol model?

Note:

This is a demo for prediction that contains of 5 protein sequences aaeX, aas, aat, abgA, abgB with their preprocessed feature files. You can directly use $ python predict.py, and then the result file will be generated in ./Predict/Result/result.csv with the output format:

name	prediction	sequence
aaeX	0.3201722800731659	MSLFPVIVVFGLSFPPIFFELLLSLAIFWLVRRVLVPTGIYDFVWHPALFNTALYC...
aas	0.2957891821861267	MLFSFFRNLCRVLYRVRVTGDTQALKGERVLITPNHVSFIDGILLGLFLPVRPVFA...
...	...	...

If you want to predict your own protein sequences with using our pretrained models please refer to the steps below.

Step 1: Prepare your single fasta files

For each protein sequence, you should prepare a corresponding fasta file.

We follow the common fasta file format that starts with >{protein sequence name}, then a protein sequence of 80 amino acid letters within one row. This is our demo in /Data/source/abgB.

>abgB
MQEIYRFIDDAIEADRQRYTDIADQIWDHPETRFEEFWSAEHLASALESAGFTVTRNVGNIPNAFIASFGQGKPVIALL
GEYDALAGLSQQAGCAQPTSVTPGENGHGCGHNLLGTAAFAAAIAVKKWLEQYGQGGTVRFYGCPGEEGGSGKTFMVRE
GVFDDVDAALTWHPEAFAGMFNTRTLANIQASWRFKGIAAHAANSPHLGRSALDAVTLMTTGTNFLNEHIIEKARVHYA
ITNSGGISPNVVQAQAEVLYLIRAPEMTDVQHIYDRVAKIAEGAALMTETTVECRFDKACSSYLPNRTLENAMYQALSH
FGTPEWNSEELAFAKQIQATLTSNDRQNSLNNIAATGGENGKVFALRHRETVLANEVAPYAATDNVLAASTDVGDVSWK
LPVAQCFSPCFAVGTPLHTWQLVSQGRTSIAHKGMLLAAKTMAATTVNLFLDSGLLQECQQEHQQVTDTQPYHCPIPKN
VTPSPLK

Note:

(1) Please name your protein sequence uniquely and as short as possible, since the protein sequence name will be used as the file name in the step 3, such as abgB.pssm, abgB.spd33.

(2) Please name your fasta file without using any suffix, such as abgB instead of abgB.fasta or abgB.fa, otherwise the feature generation software in the step 3 will name the feature file with the format of abgB.fasta.pssm or abgB.fa.pssm, leading to unexpected error.

Step 2: Prepare your total fasta file

We follow the common fasta file format that starts with >{protein sequence name}, hen a protein sequence of 80 amino acid letters within one row. This is part of our demo in ./Data/upload/input.fasta.

>aat
MRLVQLSRHSIAFPSPEGALREPNGLLALGGDLSPARLLMAYQRGIFPWFSPGDPILWWSPDPRAVLWPESLHISRSMK
RFHKRSPYRVTMNYAFGQVIEGCASDREEGTWITRGVVEAYHRLHELGHAHSIEVWREDELVGGMYGVAQGTLFCGESM
FSRMENASKTALLVFCEEFIGHGGKLIDCQVLNDHTASLGACEIPRRDYLNYLNQMRLGRLPNNFWVPRCLFSPQE
>abgA
MESLNQFVNSLAPKLSHWRRDFHHYAESGWVEFRTATLVAEELHQLGYSLALGREVVNESSRMGLPDEFTLQREFERAR
QQGALAQWIAAFEGGFTGIVATLDTGRPGPVMAFRVDMDALDLSEEQDVSHRPYRDGFASCNAGMMHACGHDGHTAIGL
GLAHTLKQFESGLHGVIKLIFQPAEEGTRGARAMVDAGVVDDVDYFTAVHIGTGVPAGTVVCGSDNFMATTKFDAHFTG
TAAHAGAKPEDGHNALLAAAQATLALHAIAPHSEGASRVNVGVMQAGSGRNVVPASALLKVETRGASDVINQYVFDRAQ
QAIQGAATMYGVGVETRLMGAATASSPSPQWVAWLQSQAAQVAGVNQAIERVEAPAGSEDATLMMARVQQHQGQASYVV
FGTQLAAGHHNEKFDFDEQVLAIAVETLARTALNFPWTRGI

Step 3: Prepare 5 node feature files and 1 edge feature file

Note:

(1) We don't integrate the feature generation software in our repository, please use the recommend software(see the table below) to generate the feature files !!!

(2) We have deployed all feature generation softwares in our servers to calculate the features in bulk, the link below is utilized to map the sequence files to feature files as an example.

(3) In the software SPOT-Contact, it needs a sequence file with suffix .fasta, thus you should rename the original fasta file abgB to abgB.fasta after generating other features.

(4) THIS STEP WILL COST MOST OF THE TIME !!!!! (The sequence with more amino acids will cost longer time, so we recommend to use the protein sequence less than 700 amino acids.)

Software	Version	Input	Output
PSI-BLAST	v2.7.1	abgB	abgB.bla, abgB.pssm
HH-Suite3	v3.0.3	abgB	abgB.hhr, abgB.hhm, abgB.a3m
SPIDER3	v1.0	abgB, abgB.pssm, abgB.hhm	abgB.spd33
DCA	v1.0	abgB.a3m	abgB.di
CCMPred	v1.0	abgB.a3m	abgB.mat
SPOT-Contact	v1.0	abgB.fasta, abgB.pssm, abgB.hhm, abgB.di, abgB.mat	abgB.spotcon

Then put all the generated files into the folder ./Data/source/(We have provided a list of files as an example). Other precautions when using the feature generation software please refer to the corresponding software document.

Step 4: Run the predict code

$ python predict.py

All the prediction result will be stored as in ./Result/result.csv.

4. The web server of the GraphSol model

Our platform are highly recommended to be academicly used only (e.g. for limited protein sequences).

https://biomed.nscc-gz.cn/apps/GraphSol

5. How to train the GraphSol model with your own data?

If you want to train a model with your own data:

(1) Please refer to the feature generation steps to preprocess 6 feature files.

(2) Use get1D_features.py and get2D_features.py to generate two matrices, and then move them to the folders ./Data/node_features and ./Data/edge_features, respectively.

(3) Make a general csv file with the format like ./Data/eSol_train.csv or ./Data/eSol_test.csv.

(4) Run $ python Train.py, and optionly tune the hypermeters in the same file.

6. Citations

Please cite our paper if you want to use our code in your work.

@article{chen2021structure,
  title={Structure-aware protein solubility prediction from sequence through graph convolutional network and predicted contact map},
  author={Chen, Jianwen and Zheng, Shuangjia and Zhao, Huiying and Yang, Yuedong},
  journal={Journal of cheminformatics},
  volume={13},
  number={1},
  pages={1--10},
  year={2021},
  publisher={Springer}
}

Awesome

GraphSol

0. Update(2022-03-01)

TODO

1. Dependencies

2. How to retrain the GraphSol model and test?

Step 1: Download all sequence features

Step 2: Decompress all .zip files

Step 3: Run the training code

Step 4: Run the test code

3. How to predict protein solubility by the pretrained GraphSol model?

Step 1: Prepare your single fasta files

Step 2: Prepare your total fasta file

Step 3: Prepare 5 node feature files and 1 edge feature file

Step 4: Run the predict code

4. The web server of the GraphSol model

5. How to train the GraphSol model with your own data?

6. Citations

Step 2: Decompress all `.zip` files