Home

Awesome

SPROF: To Improve Protein Sequence Profile Prediction through Image Captioning on Pairwise Residue Distance Map

Getting Start

These instructions will get you a copy of the project up and running on your local machine.

Environment

SPROF has been implemented in Python3.

Requirements

Installing requirements:

pip3 install -r requirements.txt

or avoiding problems in multiple Python environments:

python3 -m pip install -r requirements.txt

if you want to generate features by 'raw_feature_generate.py', install dssp by

apt-get install dssp

Data Preprocess

This repository has include the CASP13 test set data on 'input/casp_test/', 'target/casp_test/' and raw features on 'raw_features/', raw pdbs on 'raw_pdbs/'.

if you want to generate other data for training/test:

Firstly, install dssp by 'apt-get install dssp' and deposite your pdbs(the examples of pdb in correct format are on 'raw_pdbs/') on 'raw_pdbs/'.

Secondly you should make a list file namd 'xxx.txt' (for example) which contains all your training/test pdbs' name.

Thirdly use the 'raw_feature_generate.py' to generate raw features:

usage: raw_feature_generate.py [-h] [--preprocess_list PREPROCESS_LIST]
--preprocess_list PREPROCESS_LIST
                        The path of your preprocess list.

The generated raw-features will be deposit on 'raw_features', use 'preprocess.py' to generate features for model training/test:

usage: preprocess.py [-h] [--preprocess_list  PREPROCESS_LIST] 
                [--pdb_path PDB_PATH] [--features_path FEATURES_PATH]
                [--input_path INPUT_PATH] [--target_path TARGET_PATH]

optional arguments:
  -h, --help            show this help message and exit
  --preprocess_list PREPROCESS_LIST
                        he path of a preprocess pdb list.
  --pdb_path PDB_PATH
                        the path of pdb
  --features_path FEATURES_PATH
                        the path of features
  --input_path INPUT_PATH
                        the path of input to save
  --target_path TARGET_PATH
                        the path of target to save

The generated input&target data for training/test will be deposit on the input_path and target_path.

Training

Command line:

usage: train.py [-h] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE]
                [--maximum_epoch MAXIMUM_EPOCH]
                [--sequential_features SEQUENTIAL_FEATURES]
                [--pairwise_features PAIRWISE_FEATURES]
                [--target_output TARGET_OUTPUT] 
                [--train_list TRAIN_LIST] [--models_name MODELS_NAME]

optional arguments:
  -h, --help            show this help message and exit
  --learning_rate LEARNING_RATE
                        The learning rate of ADAM optimization.
  --maximum_epoch MAXIMUM_EPOCH
                        The maximum epoch of training
  --sequential_features SEQUENTIAL_FEATURES
                        The full path of sequential features of training data.
  --pairwise_features PAIRWISE_FEATURES
                        The full path of pairwise features of training data.
  --target_output TARGET_OUTPUT
                        The full path of the target outputs.
  --train_list TRAIN_LIST
                        The full path of the train list.
  --models_name MODELS_NAME
                        The name of models.

Start Training:

The input data of training is too large to upload. To get these data, please refer to 'Data Preprocess' and use the 'train_list' as preprocess_list, if any problem, contact chensh88@mail2.sysu.edu.cn.

Example of training:

    python3 train.py 
    python3 train.py --learning_rate=0.0005 --maximum_epoch=40 --sequential_features='input/casp_test/sequential_features' --pairwise_features='input/casp_test/pairwise_features' --target_output='target/casp_test' --train_list='train_list' --models_name='models'

Then you can obtain several models and save them in the your models_name directory.

Testing

Command line:

usage: test.py [-h] [--sequential_features SEQUENTIAL_FEATURES]
               [--pairwise_features PAIRWISE_FEATURES]
               [--target_output TARGET_OUTPUT] [--feature_list FEATURE_LIST]
               [--models_path MODELS_PATH]

optional arguments:
  -h, --help            show this help message and exit
  --sequential_features SEQUENTIAL_FEATURES
                        The full path of sequential features of test data.
  --pairwise_features PAIRWISE_FEATURES
                        The full path of pairwise features of test data.
  --target_output TARGET_OUTPUT
                        The full path of the target outputs.
  --feature_list FEATURE_LIST
                        The full path of the test list.
  --models_path MODELS_path
                        parent path of your models.

Usage:

The CASP13 test set has been included in the repository, so you can test our model's perpormance easily.

    python3 test.py
    python3 test.py --sequential_features='input/casp_test/sequential_features' --pairwise_features='input/casp_test/pairwise_features' --target_output='target/casp_test' --train_list='train_list' --models_path='models'

Cite

If you find this work useful in your research, please consider citing the paper:
"To Improve Protein Sequence Profile Prediction through Image Captioning on Pairwise Residue Distance Map" https://pubs.acs.org/doi/10.1021/acs.jcim.9b00438

Contact

yuedong.yang@gmail.comor chensh88@mail2.sysu.edu.cn