Awesome
TamGent
Tailoring Molecules for Protein Pockets: a Transformer-based Generative Solution for Structured-based Drug Design
Introduction
Code base: fairseq-v0.8.0
Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
Installation
git clone https://github.com/HankerWu/TamGent.git
cd TamGent
git checkout main
conda create -n TamGent python=3.7 -y
conda activate TamGent
conda install rdkit -c conda-forge -y
python -m pip install -e .[chem]
Dataset
The dataset is available at data.
Build customized dataset
You can build your customized dataset through the following methods:
-
Build customized dataset based on pdb ids, the script will automatically find the binding sites according to the ligands in the structure file.
python scripts/build_data/prepare_pdb_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
PDB_ID_LIST
format: CSV format with columns ([] means optional):pdb_id,[ligand_inchi,uniprot_id]
-
Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb.
python scripts/build_data/prepare_pdb_ids_center.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
PDB_ID_LIST
format: CSV format with columns ([] means optional):pdb_id, center_x, center_y, center_z, [uniprot_id]
-
Build dataset from PDB ID list using the residue ids(indexes) of the binding site of each pdb.
python scripts/build_data/prepare_pdb_ids_res_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} --res-ids-fn ${RES_IDS_FN}
PDB_ID_LIST
format: CSV format with columns ([] means optional):pdb_id,[uniprot_id]
RES_IDS_FN
format: residue ids filename, a dict like:{ 0: { chain_id_A: Array[res_id_A1, res_id_A2, ...], chain_id_B: Array[res_id_B1, res_id_B2, ...], ... }, 1: { ... }, ... }
stored as pickle file. The order is the same as
PDB_ID_LIST
.For customized pdb strcuture files, you can put your structure files to the
--pdb-path
folder, and in thePDB_ID_LIST
csv file, put the filenames in thepdb_id
column.
Model
The pretrained model is available at model.
Run scripts
# train a new model
bash scripts/train.sh -D ${DATA_PATH} --savedir ${SAVED_MODEL_PATH}
# generate molecules
bash scripts/generate.sh -b ${BEAM_SIZE} -s ${SEED} -D ${DATA_PATH} --dataset ${TESTSET_NAME} --ckpt ${MODEL_PATH} --savedir ${OUTPUT_PATH}
Citation
Please cite as:
@inproceedings{TamGent,
title = {Tailoring Molecules for Protein Pockets: A Transformer-based Generative Solution for Structured-based Drug Design},
author = {Kehan Wu, Yingce Xia, Yang Fan, Pan Deng, Lijun Wu, Shufang Xie, Tong Wang, Haiguang Liu, Tao Qin and Tie-Yan Liu},
year = {2022},
}