Home

Awesome

Learned ancestral sequence embeddings

Learned ancestral sequence embeddings (LASE) are protein embeddings derived from training a light-weight language model on ancestral protein sequence data.

This repository contains the script used for training a transformer for LASE. It contains raw script for training the language model, training scikit learn regressors on saved representations and evaluating the ruggedness of landscapes resulting from these representations.

Python environment

All scripts were run using Python 3.8.17 and packages listed in requirements.txt.

File overview

Data - data

Contains the mutation dictionary used for evolution as well as the regressor dataset. Data used in ancestral sequence reconstruction,for PTE and His3p, including extant sequences, ancestral sequences and phylogenetic trees are stored in the directory phylogenetic_data.

Evolution - evolution

Contains Python modules for timed in silico evolution using LASE.

Protein Language modelling - lase_tx

Contains code required for processing of data for model training/representation extraction, the transformer model used, and for training the model.

Regressor top models - sklearn_regressors

Contains modules for training the regressors (SklearnRegressorOptimisation.py), a script for executing a full run-through of the training process (sklearn_Regressor_optimisation.py) as well as bash script used to run this workflow in parallel for different top models.

Ruggedness - spectral_decomposition

Contains code for determine the global and local dirichlet energy as well as eigenvalues, eigenvectors and Graph Fourier Transform.