Awesome
EVOLVEpro
EVOLVEpro interprets PLMs embeddings through a top-layer regression model, learning the relationship between sequence and experimentally determined activities through an iterative active learning process. The lightweight random forest regression model can optimize multiple protein properties simultaneously during iterative rounds of testing with as few as 10 experimental data points per round, enabling complex multi-objective evolution campaigns and minimal experimental setup.
We employed an optimized version of EVOLVEpro to evolve a number of proteins:
EVOLVEpro optimized PsaCas12f
EVOLVEpro optimized C143 Antibody
EVOLVEpro optimized T7 RNAP
Overview
The EVOLVEpro workflow consists of four main steps:
- Process: Generate and clean FASTA and CSV files
- PLM: Extract protein language model (PLM) embeddings for all variants
- Run EVOLVEpro: Apply the model to either DMS or experimental data
- Visualize: Prepare outputs and visualizations
Step-by-Step Description
1. Process
Generate and clean FASTA and CSV files containing protein variant sequences and their corresponding activity data.
For detailed instructions, see the Process README.
2. PLM
Extract protein language model embeddings for all variants using various PLM models.
For detailed instructions, see the PLM README.
3. Run EVOLVEpro
Apply the EVOLVEpro model to optimize protein activity. There are two main workflows:
DMS Workflow
Use this workflow to optimize a few-shot model on a deep mutational scanning (DMS) dataset, where activity values are known for a large number of variants.
For detailed instructions, see the DMS README.
Experimental Workflow
Use this workflow for iterative experimental optimization of protein activity.
For detailed instructions, see the Experimental README.
4. Plot
Prepare outputs and create visualizations to interpret the results of the EVOLVEpro process.
For detailed instructions, see the Plot README.
Getting Started
Install
git clone https://github.com/mat10d/EvolvePro.git
cd EvolvePro
EVOLVEpro Environment
First, create and activate a conda environment with all necessary dependencies for EVOLVEpro:
conda env create -f environment.yml
conda activate evolvepro
Protein Language Models Environment
For installing all underlying protein language models, we use a different environment:
sh setup_plm.sh
conda activate plm
This environment includes:
- Deep learning frameworks (PyTorch)
- Protein language models that are installable via pip (ESM, ProtT5, UniRep, ankh, unirep)
- Protein language models that are only installable from github environments (proteinbert, efficient-evolution)
These environments are kept separate to maintain clean dependencies and avoid conflicts between the core EVOLVEpro functionality and the various protein language models.
Colab Tutorial
For a step-by-step guide on using EVOLVEpro to improve a protein's activity, simulated on a small dataset that we used as part of the DMS work, see our Google Colab tutorial here.
Issues
If you encounter any bugs, have feature requests, or need assistance, please open an issue on our GitHub Issues page. When opening an issue, please:
- Check if a similar issue already exists
- Include a clear description of the problem
- Add steps to reproduce the issue if applicable
- Specify your environment details (OS, Python version, etc.)
- Include any relevant error messages or screenshots
We welcome contributions and feedback from the community.
Citation
If you use this code in your research, please cite our paper:
@ARTICLE
author={Jiang, Kaiyi and Yan, Zhaoqing and Di Bernardo, Matteo and Sgrizzi, Samantha R. and Villiger, Lukas and Kayabölen, Alişan and Kim, Byungji and Carscadden, Josephine K. and Hiraizumi, Masahiro and Nishimasu, Hiroshi and Gootenberg, Jonathan S. and Abudayyeh, Omar O.}
title={Rapid protein evolution by few-shot learning with a protein language model},
year={2024},
DOI={10.1101/2024.07.17.604015}