Awesome
Protein sequence design with a learned potential
Code for the algorithm in our paper
Namrata Anand-Achim, Raphael R. Eguchi, Alexander Derry, Russ B. Altman, and Possu Huang. "Protein sequence design with a learned potential." bioRxiv (2020). [biorxiv] [cite]
Entirely AI designed four-fold symmetric TIM-barrel
Requirements
- Python 3
- PyTorch
- PyRosetta4
- Python packages in requirements.txt
- Download pretrained models here
See here for set-up instructions on Ubuntu 18.04 with Miniconda, Python 3.7, PyTorch 1.1.0, CUDA 9.0.
Design
If you'd like to use the pre-trained models to run design, jump to [this section]
Generating data
Data is available here. See the README in the drive for more information about the uploaded files. For the files used to generate the above coordinates, see the .txt files with the domain IDs (see data/train_domains_s95.txt and data/test_domain_s95.txt). These will be the inputs to regenerate the dataset. If you don't have PDB files downloaded, the script will download those and save it to pdb_dir.
If you'd like to generate the dataset or change the underlying data run the following commands.
To load and save coordinates for the backbone (BB) only model:
python load_and_save_bb_coords.py --save_dir PATH_TO_SAVE_DATA --pdb_dir PATH_TO_PDB_FILES --log_dir PATH_TO_LOG_DIR --txt PATH_TO_DOMAIN_TXT_FILE
To load and save coordinates for the main model:
python load_and_save_coords.py --save_dir PATH_TO_SAVE_DATA --pdb_dir PATH_TO_PDB_FILES --log_dir PATH_TO_LOG_DIR --txt PATH_TO_DOMAIN_TXT_FILE
Training the models
Pretrained models are available here but you can also use the available scripts to train from scratch.
To train the baseline model -- residue and autoregressive rotamer prediction conditioned on backbone (BB) atoms only model (no side-chains):
python train_autoreg_chi_baseline.py --batchSize 4096 --workers 12 --lr 1.5e-4 --validation_frequency 100 --save_frequency 1000 --log_dir PATH_TO_LOG_DIR --data_dir PATH_TO_DATA
To train the main model -- residue and autoregressive rotamer prediction conditioned on neighboring side-chains:
python train_autoreg_chi.py --batchSize 2048 --workers 12 --lr 7.5e-5 --validation_frequency 200 --save_frequency 2000 --log_dir PATH_TO_LOG_DIR --data_dir PATH_TO_DATA
Note that training was originally done across 8 V100 GPUs with DataParallel mode.
Running design
To run a design trajectory, specify starting backbone with an input PDB.
python run.py --pdb pdbs/3mx7_gt.pdb
To run a rotamer repacking trajectory with the model, specify the repack only option
python run.py --pdb pdbs/3mx7_gt.pdb --repack_only 1
To specify k-fold symmetry in design or packing, specify the symmetry options
python run.py --pdb pdbs/tim10.pdb --symmetry 1 --k 4 [--repack_only 1]
To constraint a subset of positions to remain fixed, point to a txt file with fixed residue indices, for example
python run.py --pdb pdbs/tim10.pdb --fixed_idx txt/test_idx.txt
And to constrain a subset of positions to be designed, keeping all others fixed, point to a txt file with variable residue indices, for example
python run.py --pdb pdbs/tim10.pdb --var_idx txt/test_idx.txt
See below for additional design parameters.
Monitoring metrics
Design metrics can be monitored using Tensorboard
tensorboard --log_dir='./logs'
Note that the input PDB sequence and rotamers are considered 'ground-truth' for sequence and rotamer recovery metrics.
Design parameters
- Design inputs
--pdb Path to input PDB
--model_list Paths to conditional models. (Default: ['models/conditional_model_0.pt',
'models/conditional_model_0.pt', 'models/conditional_model_1.pt',
'models/conditional_model_2.pt', 'models/conditional_model_3.pt'])
--init_model Path to baseline model for sequence initialization.
(Default: 'models/baseline_model.pt')
- Saving / logging
--log_dir Path to desired output log folder for designed
structures. (Default: ./logs)
--seed Random seed. Design runs are non-deterministic.
(Default: 2)
--save_rate How often to save intermediate designed structures
(Default: 10)
- Sequence initialization
--randomize {0,1} Randomize starting sequence/rotamers for design.
Toggle to 0 to keep starting sequence and rotamers.
(Default: 1)
--no_init_model {0,1} Do not use baseline model to predict initial sequence/rotamers.
(Default: 0)
--ala {0,1} Initialize sequence with poly-alanine. (Default: 0)
--val {0,1} Initialize sequence with poly-valine. (Default: 0)
- Rotamer repacking parameters
--repack_only {0,1} Only run rotamer repacking. (Default: 0)
--use_rosetta_packer {0,1}
Use the Rosetta packer instead of the model for
rotamer repacking during design. If in symmetry
mode, rotamers are not packed symmetrically. (Default: 0)
--pack_radius Radius in angstroms for Rosetta rotamer packing after
residue mutation. Must set --use_rosetta_packer 1
(Default: 0)
- Design parameters
--symmetry {0,1} Enforce symmetry during design (Default: 0)
--k Enforce k-fold symmetry. Input pose length must be
divisible by k. Requires --symmetry 1 (Default: 4)
--restrict_gly {0,1} Enforce no glycines for non-loop backbone positions
based on DSSP assignment. (Default: 1)
--no_cys {0,1} Enforce no cysteines in design (Default: 0)
--no_met {0,1} Enforce no methionines in design (Default: 0)
--var_idx Path to txt file listing pose indices that should be
designed/packed, all other side-chains will remain
fixed. Cannot be specified if fixed_idx file given.
Not supported with symmetry mode. 0-indexed
--fixed_idx Path to txt file listing pose indices that should NOT
be designed/packed, all other side-chains will be
designed/packed. Cannot be specified if var_idx file given.
Not supported with symmetry mode. 0-indexed
--resfile Enforce resfile on particular residues. 0-indexed
learn more about resfile
- Sampling / optimization parameters
--anneal {0,1} Option to do simulated annealing of average negative
model pseudo-log-likelihood. Toggle to 0 to do vanilla
blocked sampling (Default: 1)
--step_rate Multiplicative step rate for simulated annealing (Default: 0.995)
--anneal_start_temp Starting temperature for simulated annealing (Default: 1)
--anneal_final_temp Final temperature for simulated annealing (Default: 0)
--n_iters Total number of iterations (Default: 2500)
--threshold Threshold in angstroms for defining conditionally
independent residues for blocked sampling (should be
greater than ~17.3) (Default: 20)
Additional information
- Code expects single chain PDB input.
- Specifying fixed/variable indices not currently supported in symmetry mode.
- Model rotamer packing in symmetry mode does symmetric rotamer packing, but using the Rosetta packer does not.
Citation
If you find our work relevant to your research, please cite:
@article{anand2020protein,
title={Protein sequence design with a learned potential},
author={Anand, Namrata and Eguchi, Raphael Ryuichi and Derry, Alexander and Altman, Russ B and Huang, Possu},
journal={bioRxiv},
year={2020},
publisher={Cold Spring Harbor Laboratory}
}