Awesome

MSA-Augmentor codebase

codebase for paper Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation arxiv

Pretrain

All the commands are designed for slurm cluster, we use huggingface trainer to pretrain the model, more details could be find here

Construct local binary dataset ( load training data from cluster is too slow, so it's better to fisrt construct all your dataset to .bin file as shown in datasets )

python utils.py \
   --output_dir ./datasets/ \
   --random_src --src_seq_per_msa_l 5\
   --src_seq_per_msa_u 10 \
   --total_seq_per_msa 25 \
   --local_file_path  path_to_pretrained_dataset

install dependency libraries pip install -r requirements.txt
bash run.sh

Inference

download checkpoints
run inference by bash scripts/inference.sh

Note: all inference code is in inference.py

Evaluation

DATASET	MSA	STRUCTURE
CASP15	https://zenodo.org/record/8126538	google drive

Alphafold2 Prediction

Please refer to Alphafold2 GitHub to learn more about set up af2.
We provide scripts to use alphafold2 to launch protein structure prediction by bash scripts/run_af2, one need to modify msa directory

LDDT

follow this document for lddt evaluation tool download https://www.openstructure.org/
follow this document for https://www.openstructure.org/docs/2.4/mol/alg/lddt/ usage

Ensemble

Directly run following to get .json file of final results.

python ensemble.py --predicted_pdb_root_dir ./af2/casp15/orphan/A1T3R1.5/

:paperclip: Citation

@misc{zhang2023enhancing,
      title={Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation}, 
      author={Le Zhang and Jiayang Chen and Tao Shen and Yu Li and Siqi Sun},
      year={2023},
      eprint={2306.01824},
      archivePrefix={arXiv},
      primaryClass={q-bio.QM}
}

:email: Contact

please let us know if you have further questions or comments, reach out to [le.zhang@mila.quebec](