Awesome

MMSite: A Multi-modal Framework for the Identification of Active Sites in Proteins

Update 2024-10-29: The pretrained models of MMSite are available at https://zenodo.org/records/14004698.

1. Preparation

1. Environment

You can manage the environment by Anaconda. We have provided the environment configuration file environment.yml for reference. You can create the environment by the following command:

conda env create -f environment.yml

2. Data

You can follow the instructions in dataprocess/README.md to prepare the data. In this .md file, we provide the instruction to split the data when the clustering threshold is 10%. You can also change the threshold when you execute the mmseqs2 command.

2. Training

2.1 Download the Pre-trained Model

In our MMSite, we use the pre-trained PLM and BLM models to initialize the features. You can download the pre-trained model from the Higging Face to reproduce the main results in our paper. You can put all the downloaded models in the pretrained_weights folder.

2.1 Configuration

You can specify the configuration in config.yaml, including the paths of the pre-trained models and the data, training parameters, etc.

2.2 Training

You can train the model by the following command (It takes about 7 hours to finish training on a single NVIDIA GeForce RTX 4090 GPU):

python train.py --config /path/to/config.yaml

Then, you will get best_model_fuse_xxx.pth model in the runs/timestamp folder, which is the final model.

3. Inference

You should put your data in the dataset/infer.tsv with the format like dataset/infer_samples.tsv. Then, you should specify the path of best_model_fuse_xxx.pth in inference.py. Finally, you can run the following command to get the prediction results:

python inference.py