Awesome

Pretrained METL models

This repository contains pretrained METL models with minimal dependencies. For more information, please see the metl repository and our manuscript:

Biophysics-based protein language models for protein engineering.
Sam Gelman, Bryce Johnson, Chase Freschlin, Sameer D'Costa, Anthony Gitter+, Philip A Romero+.
bioRxiv, 2024. doi:10.1101/2024.03.15.585128
+ denotes equal contribution.

Getting started

Create a conda environment (or use existing one): conda create --name myenv python=3.9
Activate conda environment conda activate myenv
Clone this repository
Navigate to the cloned repository cd metl-pretrained
Install the package with pip install .
Import the package in your script with import metl
Load a pretrained model using model, data_encoder = metl.get_from_uuid(uuid) or one of the other loading functions (see examples below)
- model is a PyTorch model loaded with the pre-trained weights
- data_encoder is a helper object that can be used to encode sequences and variants to be fed into the model

Available models

Model checkpoints are available to download from Zenodo. Once you have a checkpoint downloaded, you can load it into a PyTorch model using metl.get_from_checkpoint(). Alternatively, you can use metl.get_from_uuid() or metl.get_from_ident() to automatically download, cache, and load the model based on the model identifier or UUID. See the examples below.

Source models

Source models predict Rosetta energy terms.

Global source models

Identifier	UUID	Params	RPE	Output	Description	Download
`METL-G-20M-1D`	`D72M9aEp`	20M	1D	Rosetta energies	METL-G	Download
`METL-G-20M-3D`	`Nr9zCKpR`	20M	3D	Rosetta energies	METL-G	Download
`METL-G-50M-1D`	`auKdzzwX`	50M	1D	Rosetta energies	METL-G	Download
`METL-G-50M-3D`	`6PSAzdfv`	50M	3D	Rosetta energies	METL-G	Download

Local source models

Identifier	UUID	Protein	Params	RPE	Output	Description	Download
`METL-L-2M-1D-GFP`	`8gMPQJy4`	avGFP	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-3D-GFP`	`Hr4GNHws`	avGFP	2M	3D	Rosetta energies	METL-L	Download
`METL-L-2M-1D-DLG4_2022`	`8iFoiYw2`	DLG4	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-3D-DLG4_2022`	`kt5DdWTa`	DLG4	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-1D-GB1`	`DMfkjVzT`	GB1	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-3D-GB1`	`epegcFiH`	GB1	2M	3D	Rosetta energies	METL-L	Download
`METL-L-2M-1D-GRB2`	`kS3rUS7h`	GRB2	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-3D-GRB2`	`X7w83g6S`	GRB2	2M	3D	Rosetta energies	METL-L	Download
`METL-L-2M-1D-Pab1`	`UKebCQGz`	Pab1	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-3D-Pab1`	`2rr8V4th`	Pab1	2M	3D	Rosetta energies	METL-L	Download
`METL-L-2M-1D-TEM-1`	`PREhfC22`	TEM-1	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-3D-TEM-1`	`9ASvszux`	TEM-1	2M	3D	Rosetta energies	METL-L	Download
`METL-L-2M-1D-Ube4b`	`HscFFkAb`	Ube4b	2M	1D	Rosetta energies	METL-L	Download
`METL-L-2M-3D-Ube4b`	`H48oiNZN`	Ube4b	2M	3D	Rosetta energies	METL-L	Download

These models will output a length 55 vector corresponding to the following energy terms (in order):

<details> <summary> Expand to see energy terms </summary>

total_score
fa_atr
fa_dun
fa_elec
fa_intra_rep
fa_intra_sol_xover4
fa_rep
fa_sol
hbond_bb_sc
hbond_lr_bb
hbond_sc
hbond_sr_bb
lk_ball_wtd
omega
p_aa_pp
pro_close
rama_prepro
ref
yhh_planarity
buried_all
buried_np
contact_all
contact_buried_core
contact_buried_core_boundary
degree
degree_core
degree_core_boundary
exposed_hydrophobics
exposed_np_AFIMLWVY
exposed_polars
exposed_total
one_core_each
pack
res_count_buried_core
res_count_buried_core_boundary
res_count_buried_np_core
res_count_buried_np_core_boundary
ss_contributes_core
ss_mis
total_hydrophobic
total_hydrophobic_AFILMVWY
total_sasa
two_core_each
unsat_hbond
centroid_total_score
cbeta
cenpack
env
hs_pair
pair
rg
rsigma
sheet
ss_pair
vdw

</details>

Function-specific source models for GB1

The GB1 experimental data measured the binding interaction between GB1 variants and Immunoglobulin G (IgG). To match this experimentally characterized function, we implemented a Rosetta pipeline to model the GB1-IgG complex and compute 17 attributes related to energy changes upon binding. We pretrained a standard METL-Local model and a modified METL-Bind model, which additionally incorporates the IgG binding attributes into its pretraining tasks.

Identifier	UUID	Protein	Params	RPE	Output	Description	Download
`METL-BIND-2M-3D-GB1-STANDARD`	`K6mw24Rg`	GB1	2M	3D	Standard Rosetta energies	Trained for the function-specific synthetic data experiment, but only trained on the standard energy terms, to use as a baseline. Should perform similarly to `METL-L-2M-3D-GB1`.	Download
`METL-BIND-2M-3D-GB1-BINDING`	`Bo5wn2SG`	GB1	2M	3D	Standard + binding Rosetta energies	Trained on both the standard energy terms and the binding-specific energy terms.	Download

METL-BIND-2M-3D-GB1-BINDING predicts the standard energy terms listed above as well as the following binding energy terms (in order):

<details> <summary> Expand to see binding energy terms </summary>

complex_normalized
dG_cross
dG_cross/dSASAx100
dG_separated
dG_separated/dSASAx100
dSASA_hphobic
dSASA_int
dSASA_polar
delta_unsatHbonds
hbond_E_fraction
hbonds_int
nres_int
per_residue_energy_int
side1_normalized
side1_score
side2_normalized
side2_score

</details>

Target models

Target models are fine-tuned source models that predict functional scores from experimental sequence-function data.

DMS Dataset	Identifier	UUID	RPE	Output	Description	Download
avGFP	`None`	`YoQkzoLD`	1D	Functional score	The `METL-L-2M-1D-GFP` model, fine-tuned on 64 examples from the avGFP DMS dataset. This model was used for the GFP design experiment described in the manuscript.	Download
avGFP	`None`	`PEkeRuxb`	3D	Functional score	The `METL-L-2M-3D-GFP` model, fine-tuned on 64 examples from the avGFP DMS dataset. This model was used for the GFP design experiment described in the manuscript.	Download

3D Relative Position Embeddings

METL uses relative position embeddings (RPEs) based on 3D protein structure. The implementation of relative position embeddings is similar to the original paper by Shaw et al. However, instead of using the default 1D sequence-based distances, we calculate relative distances based on a graph of the 3D protein structure. These 3D RPEs enable the transformer to use 3D distances between amino acid residues as the positional signal when calculating attention. When using 3D RPEs, the model requires a protein structure in the form of a PDB file, corresponding to the wild-type protein or base protein of the input variant sequence.

Our testing showed that 3D RPEs improve performance for METL-Global models but do not make a difference for METL-Local models. We provide both 1D and 3D models in this repository. The 1D models do not require the PDB structure as an additional input.

Examples

METL source model

METL source models are assigned identifiers that can be used to load the model with metl.get_from_ident().

This example:

Automatically downloads and caches METL-G-20M-1D using metl.get_from_ident("metl-g-20m-1d").
Encodes a pair of dummy amino acid sequences using data_encoder.encode_sequences().
Runs the sequences through the model and prints the predicted Rosetta energies.

Todo: show how to extract the METL representation at different layers of the network

import metl
import torch

model, data_encoder = metl.get_from_ident("metl-g-20m-1d")

# these are amino acid sequences
# make sure all the sequences are the same length
dummy_sequences = ["SMART", "MAGIC"]
encoded_seqs = data_encoder.encode_sequences(dummy_sequences)

# set model to eval mode
model.eval()
# no need to compute gradients for inference
with torch.no_grad():
    predictions = model(torch.tensor(encoded_seqs))
    
print(predictions)

If you are using a model with 3D relative position embeddings, you will need to provide the PDB structure of the wild-type or base protein.

predictions = model(torch.tensor(encoded_seqs), pdb_fn="../path/to/file.pdb")

METL target model

METL target models can be loaded using the model's UUID and metl.get_from_uuid().

This example:

Automatically downloads and caches YoQkzoLD using metl.get_from_uuid(uuid="YoQkzoLD").
Encodes several variants specified in variant notation. A wild-type sequence is needed to encode variants.
Runs the sequences through the model and prints the predicted DMS scores.

import metl
import torch

model, data_encoder = metl.get_from_uuid(uuid="YoQkzoLD")

# the GFP wild-type sequence
wt = "SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLSYGVQCFSRYPDHMKQ" \
     "HDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKN" \
     "GIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

# some example GFP variants to compute the scores for
variants = ["E3K,G102S",
            "T36P,S203T,K207R",
            "V10A,D19G,F25S,E113V"]

encoded_variants = data_encoder.encode_variants(wt, variants)

# set model to eval mode
model.eval()
# no need to compute gradients for inference
with torch.no_grad():
    predictions = model(torch.tensor(encoded_variants))

print(predictions)