Home

Awesome

<br/> <h1 align="center">ProtTrans</h1> <br/> <br/>

ProtTrans is providing state of the art pre-trained models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using various Transformer models.

Have a look at our paper ProtTrans: cracking the language of lifeโ€™s code through self-supervised deep learning and high performance computing for more information about our work.

<br/> <p align="center"> <img width="70%" src="https://github.com/agemagician/ProtTrans/raw/master/images/transformers_attention.png" alt="ProtTrans Attention Visualization"> </p> <br/>

This repository will be updated regulary with new pre-trained models for proteins as part of supporting bioinformatics community in general, and Covid-19 research specifically through our Accelerate SARS-CoV-2 research with transfer learning using pre-trained language modeling models project.

Table of Contents

<a name="news"></a>

โŒ›๏ธย  News

<a name="install"></a>

๐Ÿš€ย  Installation

All our models are available via huggingface/transformers:

pip install torch
pip install transformers
pip install sentencepiece

For more details, please follow the instructions for transformers installations.

A recently introduced change in the T5-tokenizer results in UnboundLocalError: cannot access local variable 'sentencepiece_model_pb2 and can either be fixed by installing this PR or by manually installing:

pip install protobuf

If you are using a transformer version after this PR, you will see this warning. Explicitly setting legacy=True will result in expected behavor and will avoid the warning. You can also safely ignore the warning as legacy=True is the default.

<a name="quick"></a>

๐Ÿš€ย  Quick Start

Example for how to derive embeddings from our best-performing protein language model, ProtT5-XL-U50 (aka ProtT5); also available as colab:

from transformers import T5Tokenizer, T5EncoderModel
import torch
import re

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

# Load the model
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

# only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)
model.to(torch.float32) if device==torch.device("cpu")

# prepare your protein sequences as a list
sequence_examples = ["PRTEINO", "SEQWENCE"]

# replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids
sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in sequence_examples]

# tokenize sequences and pad up to the longest sequence in the batch
ids = tokenizer(sequence_examples, add_special_tokens=True, padding="longest")

input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

# generate embeddings
with torch.no_grad():
    embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

# extract residue embeddings for the first ([0,:]) sequence in the batch and remove padded & special tokens ([0,:7]) 
emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1024)
# same for the second ([1,:]) sequence but taking into account different sequence lengths ([1,:8])
emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1024)

# if you want to derive a single representation (per-protein embedding) for the whole protein
emb_0_per_protein = emb_0.mean(dim=0) # shape (1024)

We also have a script which simplifies deriving per-residue and per-protein embeddings from ProtT5 for a given FASTA file:

python prott5_embedder.py --input sequences/some.fasta --output embeddings/residue_embeddings.h5
python prott5_embedder.py --input sequences/some.fasta --output embeddings/protein_embeddings.h5 --per_protein 1

<a name="models"></a>

โŒ›๏ธย  Models Availability

ModelHugging FaceZenodoColab
ProtT5-XL-UniRef50 (also ProtT5-XL-U50)DownloadDownloadColab
ProtT5-XL-BFDDownloadDownload
ProtT5-XXL-UniRef50DownloadDownload
ProtT5-XXL-BFDDownloadDownload
ProtBert-BFDDownloadDownload
ProtBertDownloadDownload
ProtAlbertDownloadDownload
ProtXLNetDownloadDownload
ProtElectra-Generator-BFDDownloadDownload
ProtElectra-Discriminator-BFDDownloadDownload

<a name="datasets"></a>

โŒ›๏ธย  Datasets Availability

DatasetDropbox
NEW364Download
Netsurfp2Download
CASP12Download
CB513Download
TS115Download
DeepLoc TrainDownload
DeepLoc TestDownload

<a name="usage"></a>

๐Ÿš€ย  Usage

How to use ProtTrans:

<a name="feature-extraction"></a>

<a name="logits-extraction"></a>

<a name="fine-tuning"></a>

<a name="prediction"></a>

<a name="protein-generation"></a>

<a name="visualization"></a>

<a name="benchmark"></a>

<a name="results"></a>

๐Ÿ“Šย  Original downstream Predictions

<a name="q3"></a>

ModelCASP12TS115CB513
ProtT5-XL-UniRef50818786
ProtT5-XL-BFD778584
ProtT5-XXL-UniRef50798685
ProtT5-XXL-BFD788583
ProtBert-BFD768483
ProtBert758381
ProtAlbert748279
ProtXLNet738178
ProtElectra-Generator737876
ProtElectra-Discriminator748179
ProtTXL717674
ProtTXL-BFD727577

๐Ÿ†• Predict your sequence live on predictprotein.org.

<a name="q8"></a>

ModelCASP12TS115CB513
ProtT5-XL-UniRef50707774
ProtT5-XL-BFD667471
ProtT5-XXL-UniRef50687572
ProtT5-XXL-BFD667370
ProtBert-BFD657370
ProtBert637266
ProtAlbert627065
ProtXLNet626963
ProtElectra-Generator606661
ProtElectra-Discriminator626965
ProtTXL596459
ProtTXL-BFD606560

๐Ÿ†• Predict your sequence live on predictprotein.org.

<a name="q2"></a>

ModelDeepLoc
ProtT5-XL-UniRef5091
ProtT5-XL-BFD91
ProtT5-XXL-UniRef5089
ProtT5-XXL-BFD90
ProtBert-BFD89
ProtBert89
ProtAlbert88
ProtXLNet87
ProtElectra-Generator85
ProtElectra-Discriminator86
ProtTXL85
ProtTXL-BFD86

<a name="q10"></a>

ModelDeepLoc
ProtT5-XL-UniRef5081
ProtT5-XL-BFD77
ProtT5-XXL-UniRef5079
ProtT5-XXL-BFD77
ProtBert-BFD74
ProtBert74
ProtAlbert74
ProtXLNet68
ProtElectra-Generator59
ProtElectra-Discriminator70
ProtTXL66
ProtTXL-BFD65

<a name="inaction"></a>

๐Ÿ“Šย  Use-cases

LevelTypeToolTaskManuscriptWebserver
ProteinFunctionLight AttentionSubcellular localizationLight attention predicts protein location from the language of life(Web-server)
ResidueFunctionbindEmbed21Binding ResiduesProtein embeddings and deep learning predict binding residues for various ligand classes(Coming soon)
ResidueFunctionVESPAConservation & effect of Single Amino Acid Variants (SAVs)Embeddings from protein language models predict conservation and variant effects(coming soon)
ProteinStructureProtTuckerProtein 3D structure similarity predictionContrastive learning on protein embeddings enlightens midnight zone at lightning speed
ResidueStructureProtT5dstProtein 3D structure predictionProtein language model embeddings for fast, accurate, alignment-free protein structure prediction

<a name="comparison"></a>

๐Ÿ“Šย  Comparison to other protein language models (pLMs)

While developing the use-cases, we compared ProtTrans models to other protein language models, for instance the ESM models. To focus on the effect of changing input representaitons, the following comparisons use the same architectures on top on different embedding inputs.

Task/ModelProtBERT-BFDProtT5-XL-U50ESM-1bESM-1vMetricReference
Subcell. loc. (setDeepLoc)80<b>86</b>83-AccuracyLight-attention
Subcell. loc. (setHard)58<b>65</b>62-AccuracyLight-attention
Conservation (ConSurf-DB)0.540<b>0.596</b>0.563-MCCConsEmb
Variant effect (DMS-data)-<b>0.53</b>-0.49Spearman (Mean)VESPA
Variant effect (DMS-data)-<b>0.53</b>-<b>0.53</b>Spearman (Median)VESPA
CATH superfamily (unsup.)18<b>64</b>57-AccuracyProtTucker
CATH superfamily (sup.)39<b>76</b>70-AccuracyProtTucker
Binding residues-<b>39</b>32-F1bindEmbed21

Important note on ProtT5-XL-UniRef50 (dubbed ProtT5-XL-U50): all performances were measured using only embeddings extracted from the encoder-side of the underlying T5 model as described here. Also, experiments were ran in half-precision mode (model.half()), to speed-up embedding generation. No performance degradation could be observed in any of the experiments when running in half-precision.

<a name="community"></a>

โค๏ธย  Community and Contributions

The ProtTrans project is a open source project supported by various partner companies and research institutions. We are committed to share all our pre-trained models and knowledge. We are more than happy if you could help us on sharing new ptrained models, fixing bugs, proposing new feature, improving our documentation, spreading the word, or support our project.

<a name="question"></a>

๐Ÿ“ซย  Have a question?

We are happy to hear your question in our issues page ProtTrans! Obviously if you have a private question or want to cooperate with us, you can always reach out to us directly via our RostLab email

<a name="bug"></a>

๐Ÿคย  Found a bug?

Feel free to file a new issue with a respective title and description on the the ProtTrans repository. If you already found a solution to your problem, we would love to review your pull request!.

<a name="requirements"></a>

โœ…ย  Requirements

For protein feature extraction or fine-tuninng our pre-trained models, Pytorch and Transformers library from huggingface is needed. For model visualization, you need to install BertViz library.

<a name="team"></a>

๐Ÿคตย  Team

Ahmed ElnaggarMichael HeinzingerChristian DallagoGhalia RehawiBurkhard Rost
<img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/ElnaggarAhmend.jpg?raw=true"><img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/MichaelHeinzinger-2.jpg?raw=true"><img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/christiandallago.png?raw=true"><img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/female.png?raw=true"><img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/B.Rost.jpg?raw=true">
Yu Wang
<img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/yu-wang.jpeg?raw=true">
Llion Jones
<img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/Llion-Jones.jpg?raw=true">
Tom GibbsTamas FeherChristoph Angerer
<img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/Tom-Gibbs.png?raw=true"><img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/Tamas-Feher.jpeg?raw=true"><img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/Christoph-Angerer.jpg?raw=true">
Martin Steinegger
<img width=120/ src="https://github.com/agemagician/ProtTrans/raw/master/images/Martin-Steinegger.png">
Debsindhu Bhowmik
<img width=120/ src="https://github.com/agemagician/ProtTrans/blob/master/images/Debsindhu-Bhowmik.jpg?raw=true">

<a name="sponsors"></a>

๐Ÿ’ฐย  Sponsors

<!-- <div id="banner" style="overflow: hidden;justify-content:space-around;display:table-cell; vertical-align:middle; text-align:center"> <div class="" style="max-width: 20%;max-height: 20%;display: inline-block;"> <img width="14%" src="https://github.com/agemagician/ProtTrans/blob/master/images/1200px-Nvidia_image_logo.svg.png?raw=true" alt="nvidia logo"> </div> <div class="" style="max-width: 20%;max-height: 20%;display: inline-block;"> <img width="22%" src="https://github.com/agemagician/ProtTrans/blob/master/images/Google-Logo.jpg?raw=true" alt="google cloud logo"> </div> <div class="" style="max-width: 20%;max-height: 20%;display: inline-block;"> <img width="20%" src="https://github.com/agemagician/ProtTrans/blob/master/images/Oak_Ridge_National_Laboratory_logo.svg.png?raw=true" alt="ornl logo"> </div> <div class="" style="max-width: 20%;max-height: 20%;display: inline-block;"> <img width="12%" src="https://github.com/agemagician/ProtTrans/blob/master/images/SOFTWARE_CAMPUS_logo_cmyk.jpg?raw=true" alt="software campus logo"> </div> </div> -->
NvidiaGoogleGoogleORNLSoftware Campus

<a name="license"></a>

๐Ÿ“˜ย  License

The ProtTrans pretrained models are released under the under terms of the Academic Free License v3.0 License.

<a name="citation"></a>

โœ๏ธย  Citation

If you use this code or our pretrained models for your publication, please cite the original paper:

@ARTICLE
{9477085,
author={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Yu, Wang and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and Bhowmik, Debsindhu and Rost, Burkhard},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing},
year={2021},
volume={},
number={},
pages={1-1},
doi={10.1109/TPAMI.2021.3095381}}