Awesome

Learned ancestral sequence embeddings

Learned ancestral sequence embeddings (LASE) are protein embeddings derived from training a light-weight language model on ancestral protein sequence data.

This repository contains the script used for training a transformer for LASE. It contains raw script for training the language model, training scikit learn regressors on saved representations and evaluating the ruggedness of landscapes resulting from these representations.

Python environment

All scripts (with exception of LoRA fine-tuning) were run using Python 3.8.17 and packages listed in requirements.txt.

Code for LoRA fine-tuning of ESM was run using Python 3.11.4 and packages listed in lora_requirements.txt.

File overview

Data - `data`

Contains the mutation dictionary used for evolution as well as the regressor dataset. Data used in ancestral sequence reconstruction,for PTE and His3p, including extant sequences, ancestral sequences and phylogenetic trees are stored in the directory phylogenetic_data.

Evolution - `evolution`

Contains Python modules for timed in silico evolution using LASE.

Protein Language modelling - `lase_tx`

Contains code required for processing of data for model training/representation extraction, the transformer model used, and for training the model.

Regressor top models - `sklearn_regressors`

Contains modules for training the regressors (SklearnRegressorOptimisation.py), a script for executing a full run-through of the training process (sklearn_Regressor_optimisation.py) as well as bash script used to run this workflow in parallel for different top models.

Ruggedness - `spectral_decomposition`

Contains code for determine the global and local dirichlet energy as well as eigenvalues, eigenvectors and Graph Fourier Transform.

Awesome

Learned ancestral sequence embeddings

Python environment

File overview

Data - data

Evolution - evolution

Protein Language modelling - lase_tx

Regressor top models - sklearn_regressors

Ruggedness - spectral_decomposition

Data - `data`

Evolution - `evolution`

Protein Language modelling - `lase_tx`

Regressor top models - `sklearn_regressors`

Ruggedness - `spectral_decomposition`