Home

Awesome

OntoProtein

This is the implement of the ICLR2022 paper "OntoProtein: Protein Pretraining With Ontology Embedding". OntoProtein is an effective method that make use of structure in GO (Gene Ontology) into text-enhanced protein pre-training model.

<div align=center><img src="resources/img/model.png" width="80%" height="80%" /></div>

Quick links

Overview

<span id="overview"></span>

In this work we present OntoProtein, a knowledge-enhanced protein language model that jointly optimize the KE and MLM objectives, which bring excellent improvements to a wide range of protein tasks. And we introduce ProteinKG25, a new large-scale KG dataset, promting the research on protein language pre-training.

<div align=center><img src="resources/img/main.jpg" width="60%" height="60%" /></div>

Requirements

<span id="requirements"></span> To run our code, please install dependency packages for related steps.

Environment for pre-training data generation

<span id="environment-for-pre-training-data-generation"></span> python3.8 / biopython 1.37 / goatools

For extracting the definition of the GO term, we motified the code in goatools library. The changes in goatools.obo_parser are as follows:

# line 132
elif line[:5] == "def: ":
    rec_curr.definition = line[5:]

# line 169
self.definition = ""

Environment for OntoProtein pre-training

<span id="environment-for-ontoprotein-pre-training"></span> python3.8 / pytorch 1.9 / transformer 4.5.1+ / deepspeed 0.5.1/ lmdb /

Environment for protein-related tasks

<span id="environment-for-protein-related-tasks"></span> python3.8 / pytorch 1.9 / transformer 4.5.1+ / lmdb / tape_proteins

Specially, in library tape_proteins, it only implements the calculation of metric P@L for the contact prediction task. So, for reporting the metrics P@K taking different K values, in which the metrics P@K are precisions for the top K contacts, we made some changes in the library. Detailed changes could be seen in [isssue #8]

Note: environments configurations of some baseline models or methods in our experiments, e.g. BLAST, DeepGraphGO, we provide related links to configurate as follows:

BLAST / Interproscan / DeepGraphGO / GNN-PPI

Data preparation

<span id="data-preparation"></span> For pretraining OntoProtein, fine-tuning on protein-related tasks and inference, we provide acquirement approach of related data.

Pre-training data

<span id="pre-training-data"></span> To incorporate Gene Ontology knowledge into language models and train OntoProtein, we construct ProteinKG25, a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. There have two approach to acquire the pre-training data: 1) download our prepared data ProteinKG25, 2) generate your own pre-training data.

<div align=center><img src="resources/img/times.png" width="50%" height="50%" /></div>

Download released data

We have released our prepared data ProteinKG25 in Google Drive.

The whole compressed package includes following files:

Generate your own pre-training data

For generating your own pre-training data, you need download following raw data:

When download these raw data, you can excute following script to generate pre-training data:

python tools/gen_onto_protein_data.py

Downstream task data

<span id="downstream-task-data"></span> Our experiments involved with several protein-related downstream tasks. [Download datasets]

Protein pre-training model

<span id="protein-pre-training-model"></span> You can pre-training your own OntoProtein based above pretraining dataset. Before pretraining OntoProtein, you need to download two pretrained model, respectively ProtBERT and PubMedBERT and save them in data/model_data/ProtBERT and data/model_data/PubMedBERT. We provide the script bash script/run_pretrain.sh to run pre-training. And the detailed arguments are all listed in src/training_args.py, you can set pre-training hyperparameters to your need.

Usage for protein-related tasks

<span id="usage-for-protein-related-tasks"></span>

We have released the checkpoint of pretrained model on the model library of Hugging Face. [Download model].

Running examples

The shell files of training and evaluation for every task are provided in script/ , and could directly run. Also, you can utilize the running codes run_downstream.py , and write your shell files according to your need:

Training models

Running shell files: bash script/run_{task}.sh, and the contents of shell files are as follow:

bash run_main.sh \
    --model model_data/ProtBertModel \
    --output_file ss3-ProtBert \
    --task_name ss3 \
    --do_train True \
    --epoch 5 \
    --optimizer AdamW \
    --per_device_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --eval_step 100 \
    --eval_batchsize 4 \
    --warmup_ratio 0.08 \
    --frozen_bert False

Arguments for the training and evalution script are as follows,

Additionally, you can set more detailed parameters in run_main.sh.

Notice: the best checkpoint is saved in OUTPUT_DIR/.

How to Cite

@inproceedings{
zhang2022ontoprotein,
title={OntoProtein: Protein Pretraining With Gene Ontology Embedding},
author={Ningyu Zhang and Zhen Bi and Xiaozhuan Liang and Siyuan Cheng and Haosen Hong and Shumin Deng and Qiang Zhang and Jiazhang Lian and Huajun Chen},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=yfe1VMYAXa4}
}