Home

Awesome

Pretrained METL models

GitHub Actions DOI

This repository contains pretrained METL models with minimal dependencies. For more information, please see the metl repository and our manuscript:

Biophysics-based protein language models for protein engineering.
Sam Gelman, Bryce Johnson, Chase Freschlin, Sameer D'Costa, Anthony Gitter<sup>+</sup>, Philip A Romero<sup>+</sup>.
bioRxiv, 2024. doi:10.1101/2024.03.15.585128
<sup>+</sup> denotes equal contribution.

Getting started

  1. Create a conda environment (or use existing one): conda create --name myenv python=3.9
  2. Activate conda environment conda activate myenv
  3. Clone this repository
  4. Navigate to the cloned repository cd metl-pretrained
  5. Install the package with pip install .
  6. Import the package in your script with import metl
  7. Load a pretrained model using model, data_encoder = metl.get_from_uuid(uuid) or one of the other loading functions (see examples below)
    • model is a PyTorch model loaded with the pre-trained weights
    • data_encoder is a helper object that can be used to encode sequences and variants to be fed into the model

Available models

Model checkpoints are available to download from Zenodo. Once you have a checkpoint downloaded, you can load it into a PyTorch model using metl.get_from_checkpoint(). Alternatively, you can use metl.get_from_uuid() or metl.get_from_ident() to automatically download, cache, and load the model based on the model identifier or UUID. See the examples below.

Source models

Source models predict Rosetta energy terms.

Global source models

IdentifierUUIDParamsRPEOutputDescriptionDownload
METL-G-20M-1DD72M9aEp20M1DRosetta energiesMETL-GDownload
METL-G-20M-3DNr9zCKpR20M3DRosetta energiesMETL-GDownload
METL-G-50M-1DauKdzzwX50M1DRosetta energiesMETL-GDownload
METL-G-50M-3D6PSAzdfv50M3DRosetta energiesMETL-GDownload

Local source models

IdentifierUUIDProteinParamsRPEOutputDescriptionDownload
METL-L-2M-1D-GFP8gMPQJy4avGFP2M1DRosetta energiesMETL-LDownload
METL-L-2M-3D-GFPHr4GNHwsavGFP2M3DRosetta energiesMETL-LDownload
METL-L-2M-1D-DLG4_20228iFoiYw2DLG42M1DRosetta energiesMETL-LDownload
METL-L-2M-3D-DLG4_2022kt5DdWTaDLG42M1DRosetta energiesMETL-LDownload
METL-L-2M-1D-GB1DMfkjVzTGB12M1DRosetta energiesMETL-LDownload
METL-L-2M-3D-GB1epegcFiHGB12M3DRosetta energiesMETL-LDownload
METL-L-2M-1D-GRB2kS3rUS7hGRB22M1DRosetta energiesMETL-LDownload
METL-L-2M-3D-GRB2X7w83g6SGRB22M3DRosetta energiesMETL-LDownload
METL-L-2M-1D-Pab1UKebCQGzPab12M1DRosetta energiesMETL-LDownload
METL-L-2M-3D-Pab12rr8V4thPab12M3DRosetta energiesMETL-LDownload
METL-L-2M-1D-TEM-1PREhfC22TEM-12M1DRosetta energiesMETL-LDownload
METL-L-2M-3D-TEM-19ASvszuxTEM-12M3DRosetta energiesMETL-LDownload
METL-L-2M-1D-Ube4bHscFFkAbUbe4b2M1DRosetta energiesMETL-LDownload
METL-L-2M-3D-Ube4bH48oiNZNUbe4b2M3DRosetta energiesMETL-LDownload

These models will output a length 55 vector corresponding to the following energy terms (in order):

<details> <summary> Expand to see energy terms </summary>
total_score
fa_atr
fa_dun
fa_elec
fa_intra_rep
fa_intra_sol_xover4
fa_rep
fa_sol
hbond_bb_sc
hbond_lr_bb
hbond_sc
hbond_sr_bb
lk_ball_wtd
omega
p_aa_pp
pro_close
rama_prepro
ref
yhh_planarity
buried_all
buried_np
contact_all
contact_buried_core
contact_buried_core_boundary
degree
degree_core
degree_core_boundary
exposed_hydrophobics
exposed_np_AFIMLWVY
exposed_polars
exposed_total
one_core_each
pack
res_count_buried_core
res_count_buried_core_boundary
res_count_buried_np_core
res_count_buried_np_core_boundary
ss_contributes_core
ss_mis
total_hydrophobic
total_hydrophobic_AFILMVWY
total_sasa
two_core_each
unsat_hbond
centroid_total_score
cbeta
cenpack
env
hs_pair
pair
rg
rsigma
sheet
ss_pair
vdw
</details>

Function-specific source models for GB1

The GB1 experimental data measured the binding interaction between GB1 variants and Immunoglobulin G (IgG). To match this experimentally characterized function, we implemented a Rosetta pipeline to model the GB1-IgG complex and compute 17 attributes related to energy changes upon binding. We pretrained a standard METL-Local model and a modified METL-Bind model, which additionally incorporates the IgG binding attributes into its pretraining tasks.

IdentifierUUIDProteinParamsRPEOutputDescriptionDownload
METL-BIND-2M-3D-GB1-STANDARDK6mw24RgGB12M3DStandard Rosetta energiesTrained for the function-specific synthetic data experiment, but only trained on the standard energy terms, to use as a baseline. Should perform similarly to METL-L-2M-3D-GB1.Download
METL-BIND-2M-3D-GB1-BINDINGBo5wn2SGGB12M3DStandard + binding Rosetta energiesTrained on both the standard energy terms and the binding-specific energy terms.Download

METL-BIND-2M-3D-GB1-BINDING predicts the standard energy terms listed above as well as the following binding energy terms (in order):

<details> <summary> Expand to see binding energy terms </summary>
complex_normalized
dG_cross
dG_cross/dSASAx100
dG_separated
dG_separated/dSASAx100
dSASA_hphobic
dSASA_int
dSASA_polar
delta_unsatHbonds
hbond_E_fraction
hbonds_int
nres_int
per_residue_energy_int
side1_normalized
side1_score
side2_normalized
side2_score
</details>

Target models

Target models are fine-tuned source models that predict functional scores from experimental sequence-function data.

DMS DatasetIdentifierUUIDRPEOutputDescriptionDownload
avGFPNoneYoQkzoLD1DFunctional scoreThe METL-L-2M-1D-GFP model, fine-tuned on 64 examples from the avGFP DMS dataset. This model was used for the GFP design experiment described in the manuscript.Download
avGFPNonePEkeRuxb3DFunctional scoreThe METL-L-2M-3D-GFP model, fine-tuned on 64 examples from the avGFP DMS dataset. This model was used for the GFP design experiment described in the manuscript.Download

3D Relative Position Embeddings

METL uses relative position embeddings (RPEs) based on 3D protein structure. The implementation of relative position embeddings is similar to the original paper by Shaw et al. However, instead of using the default 1D sequence-based distances, we calculate relative distances based on a graph of the 3D protein structure. These 3D RPEs enable the transformer to use 3D distances between amino acid residues as the positional signal when calculating attention. When using 3D RPEs, the model requires a protein structure in the form of a PDB file, corresponding to the wild-type protein or base protein of the input variant sequence.

Our testing showed that 3D RPEs improve performance for METL-Global models but do not make a difference for METL-Local models. We provide both 1D and 3D models in this repository. The 1D models do not require the PDB structure as an additional input.

Examples

METL source model

METL source models are assigned identifiers that can be used to load the model with metl.get_from_ident().

This example:

Todo: show how to extract the METL representation at different layers of the network

import metl
import torch

model, data_encoder = metl.get_from_ident("metl-g-20m-1d")

# these are amino acid sequences
# make sure all the sequences are the same length
dummy_sequences = ["SMART", "MAGIC"]
encoded_seqs = data_encoder.encode_sequences(dummy_sequences)

# set model to eval mode
model.eval()
# no need to compute gradients for inference
with torch.no_grad():
    predictions = model(torch.tensor(encoded_seqs))
    
print(predictions)

If you are using a model with 3D relative position embeddings, you will need to provide the PDB structure of the wild-type or base protein.

predictions = model(torch.tensor(encoded_seqs), pdb_fn="../path/to/file.pdb")

METL target model

METL target models can be loaded using the model's UUID and metl.get_from_uuid().

This example:

import metl
import torch

model, data_encoder = metl.get_from_uuid(uuid="YoQkzoLD")

# the GFP wild-type sequence
wt = "SKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLSYGVQCFSRYPDHMKQ" \
     "HDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKN" \
     "GIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK"

# some example GFP variants to compute the scores for
variants = ["E3K,G102S",
            "T36P,S203T,K207R",
            "V10A,D19G,F25S,E113V"]

encoded_variants = data_encoder.encode_variants(wt, variants)

# set model to eval mode
model.eval()
# no need to compute gradients for inference
with torch.no_grad():
    predictions = model(torch.tensor(encoded_variants))

print(predictions)