Home

Awesome

UDSMProt, universal deep sequence models for protein classification

UDSMProt is an algorithm for the classification of proteins based on the sequence of amino acids alone. Its key component is a self-supervised pretraining step based on a language modeling task. The model is then subsequently finetuned to specific classification tasks. In our paper we considered enzyme class classification, gene ontology prediction and remote homology detection showcasing the excellent performance of UDSMProt.

For a detailed description of technical details and experimental results, please refer to our paper:

Nils Strodthoff, Patrick Wagner, Markus Wenzel, and Wojciech Samek, UDSMProt: universal deep sequence models for protein classification, Bioinformatics 36, no. 8, 2401-2409, 2020.

@article{Strodthoff:2019universal,
author = {Strodthoff, Nils and Wagner, Patrick and Wenzel, Markus and Samek, Wojciech},
title = "{UDSMProt: universal deep sequence models for protein classification}",
journal = {Bioinformatics},
volume = {36},
number = {8},
pages = {2401-2409},
year = {2020},
month = {01},
issn = {1367-4803},
doi = {10.1093/bioinformatics/btaa003},
}

An earlier preprint of this work is also available at bioRxiv. This is the accompanying code repository, where we also provide links to pretrained language models.

Also have a look at USMPep:Universal Sequence Models for Major Histocompatibility Complex Binding Affinity Prediction that builds on the same framework.

Dependencies

for training/evaluation: pytorch fastai fire

for dataset creation: numpy pandas scikit-learn biopython sentencepiece lxml

Installation

We recommend using conda as Python package and environment manager. Either install the environment using the provided proteomics.yml by running conda env create -f proteomics.yml or follow the steps below:

  1. Create conda environment: conda create -n proteomics and conda activate proteomics
  2. Install pytorch: conda install pytorch -c pytorch
  3. Install fastai: conda install -c fastai fastai=1.0.52
  4. Install fire: conda install fire -c conda-forge
  5. Install scikit-learn: conda install scikit-learn
  6. Install Biopython: conda install biopython -c conda-forge
  7. Install sentencepiece: pip install sentencepiece
  8. Install lxml: conda install lxml

Optionally (for support of threshold 0.4 clusters) install cd-hit and add cd-hit to the default searchpath.

Data

Swiss-Prot and UniRef

EC prediction

GO prediction

Remote Homology Detection

Data Preprocessing

cd code 
./create_datasets.sh

Basic Usage

We provide some basic usage information for the most common tasks:

cd code
python modelv1.py language_model --epochs=60 --lr=0.01 --working_folder=datasets/lm/lm_sprot_dirty/ --export_preds=False --eval_on_val_test=True
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --metrics=["accuracy","macro_f1"] --lr=0.001 --lr_fixed=True --bs=32 --lr_slice_exponent=2.0 --working_folder=datasets/clas_ec/clas_ec_ec50_level1 --export_preds=True --eval_on_val_test=True
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=30 --lr=0.001 --lr_fixed=True --bs=32 --lin_ftrs=[1024] --lr_slice_exponent=2.0 --metrics=[] --working_folder=datasets/clas_go/clas_go_deepgoplus_2016 --export_preds=True --eval_on_val_test=True
cd code
python modelv1.py classification --from_scratch=False --pretrained_folder=datasets/lm/lm_sprot_uniref_fwd --epochs=10 --bs=128 --metrics=["binary_auc","binary_auc50","accuracy"] --early_stopping=binary_auc --bs=64 --lr=0.05 --fit_one_cycle=False --working_folder=datasets/clas_scop/clas_scop0 --export_preds=True --eval_on_val_test=True

The output is logged in logfile.log in the working directory, the final results are exported for convenience as result.npy and individual predictions that can be used for example for ensembling forward and backward models are exported as preds_valid.npz and preds_valid.npz (in case export_preds is set to true).