Awesome
PoET: A generative model of protein families as sequences-of-sequences
This repo contains inference code for "PoET: A generative model of protein families as sequences-of-sequences", a state-of-the-art protein language model for variant effect prediction and conditional sequence generation.
Environment Setup
- Have
mamba
(faster alternative toconda
) installed (Instructions) - Have
conda-lock
installed in your base conda/mamba environment (Instructions) - Run
make create_conda_env
. This will create a conda environment namedpoet
. - Run
make download_model
to download the model (~400MB). The model will be located atdata/poet.ckpt
. Please note the license.
Scoring variants
Use the script scripts/score.py
to obtain fitness scores for a list of protein variants given a MSA of homologs of the WT sequence.
-
Be on a machine with a NVIDIA GPU. The model cannot run on CPU only.
-
Activate the
poet
conda environment -
Run the script, replacing the values in angle brackets with the appropriate paths.
python scripts/score.py \ --msa_a3m_path <path to MSA of homologs of WT sequence> \ --variants_fasta_path <path to fasta file containing variants to score> \ --output_npy_path <path to output file where scores for each variant will be stored as a numpy array>
You can pass a lower value for the batch size (--batch_size
) if you run out of VRAM. The script was tested on an A100 GPU with 40GB VRAM.
Example
Run the scoring script without arguments python scripts/score.py
to score variants in the BLAT_ECOLX_Jacquier_2013
dataset from ProteinGym.
- the dataset is located at
data/BLAT_ECOLX_Jacquier_2013.csv
- the variants to score as a fasta file is located at
data/BLAT_ECOLX_Jacquier_2013_variants.fasta
- the MSA of homologs of the WT sequence, generated using ColabFold MMseqs2 with the UniRef2202 database, is located at
data/BLAT_ECOLX_ColabFold_2202.a3m
- the scores will be saved as a numpy array at
data/BLAT_ECOLX_Jacquier_2013_variants.npy
The scores obtained from the script should obtain >0.65
Spearman correlation with the measured fitness (DMS_score column in the dataset file).
Citation
You may cite the paper as
@inproceedings{NEURIPS2023_f4366126,
author = {Truong Jr, Timothy and Bepler, Tristan},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {77379--77415},
publisher = {Curran Associates, Inc.},
title = {PoET: A generative model of protein families as sequences-of-sequences},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/f4366126eba252699b280e8f93c0ab2f-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
License
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
The PoET model weights (DOI: 10.5281/zenodo.10061322
) are available under the CC BY-NC-SA 4.0 license for academic use only. The license can also be found in the LICENSE file provided with the model weights. For commercial use, please reach out to us at contact@ne47.bio about licensing. Copyright (c) NE47 Bio, Inc. All Rights Reserved.