Awesome
VespaG: Expert-Guided Protein Language Models enable Accurate and Blazingly Fast Fitness Prediction
<img align="right" src="images/vespag.png" alt="image" height="20%" width="20%" />VespaG is a blazingly fast single amino acid variant effect predictor, leveraging embeddings of the protein language model ESM-2 (Lin et al. 2022) as input to a minimal deep learning model.
To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from a subset of the Human proteome, which we then annotated using predictions from the multiple sequence alignment-based effect predictor GEMME (Laine et al. 2019) as a proxy for experimental scores.
Assessed on the ProteinGym (Notin et al. 2023) benchmark, VespaG matches state-of-the-art methods while being several orders of magnitude faster, predicting the entire single-site mutational landscape for a human proteome in under a half hour on a consumer-grade laptop.
More details on VespaG can be found in the corresponding preprint.
Installation
conda env create -n vespag python==3.10 poetry==1.8.3
(exchangeconda
formamba
,miniconda
ormicromamba
as you like)conda activate vespag
export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
poetry install
Quick Start: Running Inference with VespaG
Run python -m vespag predict
with the following options:
Required:
--input/-i
: Path to FASTA-formatted file containing protein sequence(s).
Optional:--output/-o
:Path for saving created CSV and/or H5 files. Defaults to./output
.--embeddings/-e
: Path to pre-computed ESM2 (esm2_t36_3B_UR50D
) input embeddings. Embeddings will be generated from scratch if no path is provided and saved in./output
. Please note that embedding generation on CPU can be slow.--mutation-file
: CSV file specifying specific mutations to score. If not provided, the whole single-site mutational landscape of all input proteins will be scored.--id-map
: CSV file mapping embedding IDs (first column) to FASTA IDs (second column) if they're different. Does not have to cover cases with identical IDs.--single-csv
: Whether to return one CSV file for all proteins instead of a single file for each protein.--no-csv
: Whether no CSV output should be produced.--h5-output
: Whether a file containing predictions in HDF5 format should be created.--zero-idx
: Whether to enumerate protein sequences (both in- and output) starting at 0.--transform
: Whether to transform predicted scores to the same distribution as GEMME substitution scores, which fall into a narrower range than VespaG scores, to ease comparability--normalize
: Whether to transform predicted scores to the [0, 1] interval by applying a sigmoid
Examples
After installing the dependencies above and cloning the VespaG repo, you can try out the following examples:
- Run VespaG without precomputed embeddings for the example fasta file with 3 sequences in
data/example/example.fasta
:python -m vespag predict -i data/example/example.fasta
. This will save a CSV file for each sequence in the folder./output
- Run VespaG with precomputed embeddings for the example fasta file with 3 sequences in
data/example/example.fasta
:python -m vespag predict -i data/example/example.fasta -e output/esm2_embeddings.h5 --single-csv
. This will save a single CSV file for all sequences in the folder./output
Re-training VespaG
VespaG uses DVC for pipeline orchestration and WandB for experiment tracking.
Using WandB is optional; a username and project for WandB can be specified in params.yaml
.
Using DVC is non-optional. There is a dvc.yaml
file in place that contains stages for generating pLM embeddings from FASTA files, but you can also download pre-computed embeddings and GEMME scores from our Zenodo repository. Adjust paths in params.yaml
to your context, and feel free to play around with model parameters. You can simply run a training run using dvc repro -s train@<model_type>-{esm2|prott5}-<dataset>
, with <model_type>
and <dataset>
each corresponding to a named block in params.yaml
.
Evaluation
You can reproduce our evaluation using the eval
subcommand, which pre-processes data into a format usable by VespaG, runs predict
, and computes performance metrics.
ProteinGym217
Based on the ProteinGym (Notin et al. 2023) DMS substitutions benchmark, dubbed ProteinGym217 by us. Run it with python -m vespag eval proteingym
, with the following options:
Optional:
--reference-file
: Path to ProteinGym reference file. Will download todata/test/proteingym217/reference.csv
ordata/test/proteingym87/reference.csv
if not provided.--dms-directory
: Path to directory containing per-DMS score files in CSV format. Will download todata/test/proteingym217/raw_dms_files/
ordata/test/proteingym87/raw_dms_files/
if not provided.--output/-o
:Path for saving created CSV with scores for all assays and variants as well as a CSV with Spearman correlation coefficients for each DMS. Defaults to./output/proteingym217
or./output/proteingym87
.--embeddings/-e
,--id-map
,--normalize-scores
: identical topredict
, used for the internal call to it.--v1
if you want to get a result for the first iteration of ProteinGym with 87 assays.
Preprint Citation
If you find VespaG helpful in your work, please be so kind as to cite our pre-print:
@article{vespag,
author = {Celine Marquet and Julius Schlensok and Marina Abakarova and Burkhard Rost and Elodie Laine},
title = {VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction},
year = {2024},
doi = {10.1101/2024.04.24.590982},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2024/04/28/2024.04.24.590982},
journal = {bioRxiv}}