Awesome
CARBonAra: Context-aware geometric deep learning for protein sequence design
Overview
CARBonAra is a deep learning framework that facilitates protein sequence design by leveraging atomic coordinates, allowing for context-aware sequence generation. This method is particularly useful for integrating protein design with molecular environments, including non-protein entities, providing more control to protein engineering.
Features
- Geometric Transformer: The framework uses a geometric transformer model based only on atomic coordinates and atomic elements, allowing it to handle any protein backbone scaffolds and various molecular environments.
- Context Awareness: CARBonAra's design accounts for molecular environments, including non-protein entities, providing context-aware sequence generation.
- Imprint Sequence Sampling: CARBonAra's imprint sampling method provides diverse sequences, balancing design flexibility with high-confidence predictions.
Install
You can install CARBonAra easily in a few minutes using pip
and conda
:
- Clone the repository:
git clone https://github.com/LBM-EPFL/CARBonAra
cd CARBonAra
- Create and activate a new conda environment:
conda create -n carbonara
conda activate carbonara
- Install the package and dependencies:
pip install .
Usage
Command line tool
To generate sequences using a specific protein structure:
carbonara --num_sequences 100 --imprint_ratio 0.5 examples/pdbs/2oob.pdb outputs
Python package
To use CARBonAra directly in a Python script:
from carbonara import CARBonAra, imprint_sampling
# load model
carbonara = CARBonAra(device_name="cuda")
# sample sequences
sequences, scores, pssm, structure_scaffold = imprint_sampling(
carbonara=carbonara,
pdb_filepath="examples/pdbs/1zns.pdb", # input structure
num_sample=100, # number of sequences to sample
imprint_ratio=0.5, # control sampling diversity with prior
)
For more detailed examples and use cases, see quickstart.ipynb.
Functionalities
required arguments
pdb_filepath
(file path): Input scaffold structure file path. The program will automatically remove hydrogens and side chains from the scaffold. The default is to keep the 'hetatm' and water, but this behavior can be toggled.num_sample
(number): Number of sequences to sample using CARBonAra.imprint_ratio
(between 0.0 and 1.0): Ratio of the prediction to use as prior information for sampling. We generate diversity in the sequence space by imprinting predicted prior sequence information at randomly selected positions. The imprinting of predicted prior information creates variability in the final position-specific scoring matrix (PSSM) used to sample sequences. The sequences are sampled from the maximum confidence of the final PSSM to guarantee high-confidence sequences. The imprinting selects how much prior should be added: 0.0 means no prior information, and 1.0 means all the positions will contain prior information to bias the prediction.
optional arguments
b_sampled
(default:True
): If true, use the prior sampled from probability for more diversity; if false, use the prior sampled from maximum confidence for sampling. This flag controls the type of prior information imprinted to generate variability in the final position-specific scoring matrix (PSSM). Sampled from probability (True
) will convert the prediction confidence into probabilities and then sample amino acids from those probabilities. It will result in higher diversity while still maintaining reasonable sequence confidence. Maximum confidence (False
) will use the maximum confidence prediction as prior information. It will result in high-confidence predictions but low diversity.known_chains
(default:[]
): List of known chains (e.g.['A', 'B']
) for partial sequence prediction. The software will not predict new sequences for the selected chains. Moreover, the software will use the sequence information of the selected chains as prior information for the prediction.known_positions
(default:[]
): List of known sequence positions (e.g.[37, 38, 39, 40]
) for partial sequence prediction. The software will not predict new sequences for the selected sequence positions. Moreover, the software will use the sequence information of the chosen positions as prior information for the prediction.unknown_positions
(default:[]
): List of unknown sequence positions (e.g.[37, 38, 39, 40]
) and will overwrite the other known flags (known_chains
andknown_positions
). This option should be used when only a relatively small part of the structure should be re-designed. The software will only predict new sequences for the selected sequence positions and will use the rest of the sequence as prior information for the prediction.ignored_amino_acids
(default:[]
): List of amino acids to completely ignore for the sequence sampling (e.g.['C']
). The prior information and generated sequences will not contain the selected amino acids.ignore_hetatm
(default:False
): Flag to ignore hetatm in the structure. By default, the software uses all atoms in the input file for the prediction, such as ligands, lipids, ions, and water.ignore_wat
(default:False
): Flag to ignore water in the structure. By default, the software uses the water molecules in the input file for the prediction.device
(default:cuda
): Device to choose for running the model:cpu
orcuda
.
Reproducibility
Repository structure
- model: Code for the training of the model and the trained model.
- src: Functions used throughout the repository.
- examples: A few examples of inputs and outputs to get started.
- results: Code to reproduce the results presented in the article.
Results
- model_analysis: Analysis and benchmarking of the model, such as the context awareness.
- model_comparison: Comparison with ProteinMPNN and ESM-IF1.
- md_analysis: Analysis of the predictions of CARBonAra when applied to molecular dynamics simulations.
- cole7: Study case of Colicin E7 (PDB ID: 1ZNS).
- tem1: Study case of TEM-1 (PDB ID: 1BT5 / 1JTG).
Anaconda environment
To replicate the specific environment used for development, create and activate it using:
conda env create -f carbonara.yml
conda activate carbonara
ESM-IF1 integration
For additional benchmarking with ESM-IF1, install it as follow:
conda create -n inverse python=3.9
conda activate inverse
conda install pytorch cudatoolkit=11.3 -c pytorch
conda install pyg -c pyg -c conda-forge
conda install pip
pip install ipykernel biotite
pip install git+https://github.com/facebookresearch/esm.git
License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.