Awesome
Learning protein fitness models from evolutionary and assay-labelled data
This repo is a collection of code and scripts for evaluating methods that combine evolutionary and assay-labelled data for protein fitness prediction.
For more details, please see our pre-print Combining evolutionary and assay-labelled data for protein fitness prediction.
Contents
- Repo contents
- System requirements
- Installation
- Demo
- Jackhmmer search
- Fitness data
- Density models
- Predictors
Repo contents
There are several main components of the repo.
data
: Processed protein fitness data. (Only one example data set is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)alignments
: Processed multiple sequence alignments. (Only one example alignment is provided here due to GitHub repo size constraints. Please download all alignments from Dryad doi:10.6078/D1K71B.)scripts
: Bash and Python scripts for data collection and data analysis.src
: Python code for training and evaluating the methods assessed in the paper. Also includes the evaluation and comparison framework of different predictors.environment.yml
: Software dependencies for conda environment.
When running the provided scripts, the outputs will be written to the following directories:
inference
: Directory for intermediate files such as inferred sequence log likelihoods.results
: Directory for results as csv files.
System requirements
Hardware requirements
Some of the methods, in particular DeepSequence VAE, UniRep mLSTM, and ESM Transformer, require GPU for training and inference. The GPU code in this repo has been tested on NVIDIA Quadro RTX 8000 GPU.
Evaluating all the methods each with 20 random seeds, 19 data sets, and 10 training setups would require a relatively long time on a single core. Our evaluation code supports multiprocessing and has been tested on 32 cores.
For storing all intermediate files for all methods and all data sets, approximately 100G of disk space will be needed.
Software requirements
The code has been tested on Ubuntu 18.04.5 LTS (Bionic Beaver) with conda 4.10.0
and Python 3.8.5.
The (optional) slurm scripts have been tested on slurm 17.11.12.
The list of software dependencies are provided in the environment.yml
file.
Installation
- Create the conda environment from the environment.yml file:
conda env create -f environment.yml
- Activate the new conda environment:
conda activate protein_fitness_prediction
- Install the plmc package:
cd $HOME (or use another directory for plmc <directory_to_install_plmc> and
modify `scripts/plmc.sh` accordingly with the custom directory)
git clone https://github.com/debbiemarkslab/plmc.git
cd plmc
make all-openmp
The installation should finish in a few minutes.
Demo
The one-hot linear model is the simplest example as it only requires assay-labelled data. To evaluate the one-hot linear model on the Poly(A)-binding protein (PABP) data with 240 training examples and 20 seeds on a single core:
python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 onehot --n_seeds=20 --n_threads=1 --n_train=240
When the program finishes, the results from the 20 runs will be available in the
file results/BLAT_ECOLX_Ranganathan2015-2500/results.csv
.
As another example that involves both evolutionary and assay-lablled data, here we show the process to evaluate the augmented Potts model on the same protein.
The multiple sequence alignments (MSAs) are available in the alignments
directory for all proteins used in our assessment. For other proteins,
MSAs can be retrieved by jackhmmer search (see the jackhmmer search section).
From the MSA, first run PLMC to estimate the couplings model:
bash scripts/plmc.sh BLAT_ECOLX BLAT_ECOLX_Ranganathan2015-2500
The resulting models are saved at inference/BLAT_ECOLX_Ranganathan2015-2500/plmc
.
Then, similar to the one-hot linear model evaluation, run:
python src/evaluate.py BLAT_ECOLX_Ranganathan2015-2500 ev+onehot --n_seeds=20 --n_threads=1 --n_train=240
The evaluation should finish in a few minutes, and all results will be saved to
results/BLAT_ECOLX_Ranganathan2015-2500/results.csv
.
Here, ev+onehot
refers to the augmented Potts model. Other models and data
sets can also be similarly evaluated as long as the corresponding prerequisite
files are present in the inference directory.
Jackhmmer search
- Downloaded UniRef100 in fasta format from UniProt.
- Index the uniref100 fasta file into ssi with
esl-sfetch --index <seqdb>.fasta
- Set the file location of the fasta file in
scripts/jackhmmer.sh
. - To run jackhmmer, use
scripts/jackhmmer.sh
to search the local fasta file. In addition to running jackhmmer search, the script also implicitly calls the other file conversion scripts. For example, it extracts target ids from the jackhmmer tabular output by callingscripts/tblout2ids.py
; converts the fasta output to list of sequences byscripts/fasta2txt.py
; and splits the sequences into train and validation withscripts/randsplit.py
. - The outputs of the jackhmmer script will be in
jackhmmer/<dataset>/<run_id>
, where each iteration's alignment is saved asiter-<N>.a2m
and the final alignment is saved asalignment.a2m
. The list of full length target sequences is attarget_seqs.fasta
andtarget_seqs.txt
.
Fitness data
In the example data set in the data
directory (and also for all other data sets
available on Dryad), each subdirectory (e.g. data/BLAT_ECOLX_Ranganathan2015-2500
)
represents a data set of interest. In the subdirectory, there are two key files.
wt.fasta
documents the WT sequence.data.csv
contains three columns:seq
,log_fitness
,n_mut
.seq
is the sequence with mutation, and should be the same length as WT seq.log_fitness
is the log enrichment ratio or other log-scale fitness values, where higher is better. Although referred to aslog_fitness
here, this corresponds tofitness
in the paper.n_mut
is how many mutations away the sequence is from WT, where 0 indicates WT.
Density models
Potts model
For learning a Potts model (EVmutation / plmc) from an MSA, see scripts/plmc.sh
.
The resulting couplings model files (saved to the inference directory) can be
directly parsed by our correpsonding ev
and ev+onehot
predictors.
DeepSequence VAE
- Install the DeepSequence package.
- Put the DeepSequence package directory as
WORKING_DIR
in bothsrc/train_vae.py
andsrc/inference_vae.py
. - Use
scripts/train_vae.sh
for training a VAE model from an MSA. - For retrieving ELBOs from VAEs, see
scripts/inference_vae.sh
. - The saved elbo files in the inference directory can be parsed by the
corresponding
vae
andvae+onehot
predictors.
ESM
- Follow the instructions from the ESM repo to download the pre-trained model weights.
- Put the downloaded pre-trained weights location into
scripts/inference_esm.sh
. - To retrieve ESM Transformer approximate pseudo-log likelihoods for sequences
in a fasta file, see
scripts/inference_esm.sh
. The results will be in the inference directory and can be used by theesm
andesm+onehot
predictors.
UniRep
- Download the pre-trained UniRep weights (1900-unit) from the UniRep repo.
- Put the location for the downloaded weights into
scripts/evotune_unirep.sh
. - Use the
scripts/evotune_unirep.sh
script to evotune the UniRep model with an MSA file asseqspath
. - Use
scripts/inference_unirep.sh
to calculate log-likelihoods from an evotuned unirep model.
Predictors
Each type of predictor is represented by a Python class in src/predictors
.
A predictor class represents a prediction strategy for protein fitness that
depends on evolutionary data, assay-labelled data, or both.
The base predictor class, BasePredictor
is defined at src/predictors/base_predictors.py
.
All predictor classes inherit from this class and have the train
and predict
methods.
The JointPredictor
class is a meta predictor that combines the features from multiple
existing predictor classes, and can be easily specified by the sub-predictor names.
See src/predictors/__init__.py
for a full list of implemented predictors.