Awesome

TamGent

Tailoring Molecules for Protein Pockets: a Transformer-based Generative Solution for Structured-based Drug Design

Introduction

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.

Installation

git clone https://github.com/HankerWu/TamGent.git
cd TamGent
git checkout main

conda create -n TamGent python=3.7 -y
conda activate TamGent
conda install rdkit -c conda-forge -y
python -m pip install -e .[chem]

Dataset

The dataset is available at data.

Build customized dataset

You can build your customized dataset through the following methods:

Build customized dataset based on pdb ids, the script will automatically find the binding sites according to the ligands in the structure file.
```
python scripts/build_data/prepare_pdb_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
```
PDB_ID_LIST format: CSV format with columns ([] means optional):

pdb_id,[ligand_inchi,uniprot_id]
Build customized dataset based on pdb ids using the center coordinates of the binding site of each pdb.
```
python scripts/build_data/prepare_pdb_ids_center.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} -t ${threshold}
```
PDB_ID_LIST format: CSV format with columns ([] means optional):

pdb_id, center_x, center_y, center_z, [uniprot_id]
Build dataset from PDB ID list using the residue ids(indexes) of the binding site of each pdb.
```
python scripts/build_data/prepare_pdb_ids_res_ids.py ${PDB_ID_LIST} ${DATASET_NAME} -o ${OUTPUT_PATH} --res-ids-fn ${RES_IDS_FN}
```
PDB_ID_LIST format: CSV format with columns ([] means optional):

pdb_id,[uniprot_id]

RES_IDS_FN format: residue ids filename, a dict like:
```
{
  0:
    {
      chain_id_A: Array[res_id_A1, res_id_A2, ...],
      chain_id_B: Array[res_id_B1, res_id_B2, ...],
      ...
    },
  1:
    {
      ...
    },
  ...
}  
```
stored as pickle file. The order is the same as PDB_ID_LIST.

For customized pdb strcuture files, you can put your structure files to the --pdb-path folder, and in the PDB_ID_LIST csv file, put the filenames in the pdb_id column.

Model

The pretrained model is available at model.

Run scripts

# train a new model
bash scripts/train.sh -D ${DATA_PATH} --savedir ${SAVED_MODEL_PATH}

# generate molecules
bash scripts/generate.sh -b ${BEAM_SIZE} -s ${SEED} -D ${DATA_PATH} --dataset ${TESTSET_NAME} --ckpt ${MODEL_PATH} --savedir ${OUTPUT_PATH}

Citation

Please cite as:

@inproceedings{TamGent,
  title = {Tailoring Molecules for Protein Pockets: A Transformer-based Generative Solution for Structured-based Drug Design},
  author = {Kehan Wu, Yingce Xia, Yang Fan, Pan Deng, Lijun Wu, Shufang Xie, Tong Wang, Haiguang Liu, Tao Qin and Tie-Yan Liu},
  year = {2022},
}