Home

Awesome

PPIHP

Protein Prediction for Interpretation of Hallucinated Proteins

This script allows ultra-fast (around 30 minutes per proteome) prediction of various protein properties using the protein language model ProtT5. For an overview of currently supported predictors, check the Output section.

Additionally, this script allows to generate completely novel protein sequences using ProtGPT2 which can be further filtered using the aforementioned predictions.

This tool which ties together many existing predictors was first introduced in From sequence to function through structure: Deep learning for protein design.

Installation

Create a new virtual environment, e.g. using conda:

conda create -n PPIHP python=3.8
conda activate PPIHP

Without 3D structure prediction:

If you do not need 3D structure predictions (avoids many dependencies)

pip install -r requirements_minimal.txt

With 3D structure prediction:
If you use CUDA 11, you can use the provided requirements.txt to install dependencies for all predictors:

pip install -r requirements.txt

If you use a different version, please use pip or conda to install the following packages:

torch (1.11)
dgl
pyg (aka torch-geometric)
e3nn
psutil
transformers
sentencepiece
biopython
matplotlib

Usage

If you pass an input FASTA, predictions for the given sequences will be generated and written to the directory defined by the output parameter (this will download around 3GB of model weights in total):

python prott5_batch_predictor.py --input example_output/pp_examples.fasta --output example_output

If you only pass an output directory without any input-FASTA, the script will generate new random proteins using ProtGPT2 and generate predictions for those hallucinated proteins (this will download additionally 2.5GB model weights of ProtGPT2):

python prott5_batch_predictor.py --output halluzination_analysis --n_gen 50

The parameter n_gen allows you to control the number of sequences to generate.

If you only need a subset of predictors, you can adjust which predictors to run using the --fmt parameter:

python prott5_batch_predictor.py --output halluzination_analysis --n_gen 50 --fmt ss,cons,dis,mem,bind,go,subcell,tucker,emb,ember3D

This allows, for example, to (de-)activate 3D structure prediction. See --help for more information on the output format. By default, all predictors except 3D structure prediction and per-proteins embeddings are written (those quickly generate large amounts of data when applied to millions of proteins and should only be used with caution; default=--fmt ss,cons,dis,mem,bind,go,subcell,tucker.

Reproducibility

The datasets used for the analysis in the manuscript are available at: http://data.bioembeddings.com/public/design/. Place them into a folder called private inside this repo to run the Jupyter Notebooks. Predictions were generated on a server with an Intel Xeon Gold 6248 CPU, a Quadro RTX 8000 (48GB vRAM) GPU and 400GB RAM DDR4 ECC (OS=Ubuntu).

Outputs

The current scripts generates a wealth of information for each input protein sequences. Every predictor generates one output file. In general, all files are written such that each line holds information on a single protein and sorting between files is identical. The ids.txt file explained below allows to backtrace which prediction refers to which protein. The 3D structure prediction is an exception to this schema as it writes one PDB-file per input protein. The following files are currently generated:

General:

3D Structure prediction:

Predictions available for each residue in a protein:

Predictions available for each protein:

Citations

@article{9477085,
author={Elnaggar, Ahmed and Heinzinger, Michael and Dallago, Christian and Rehawi, Ghalia and Yu, Wang and Jones, Llion and Gibbs, Tom and Feher, Tamas and Angerer, Christoph and Steinegger, Martin and Bhowmik, Debsindhu and Rost, Burkhard},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={ProtTrans: Towards Cracking the Language of Lifes Code Through Self-Supervised Deep Learning and High Performance Computing},
year={2021},
volume={},
number={},
pages={1-1},
doi={10.1109/TPAMI.2021.3095381}
}
@article{Ferruz2022.03.09.483666,
	author = {Ferruz, Noelia and Schmidt, Steffen and H{\"o}cker, Birte},
	title = {A deep unsupervised language model for protein design},
	elocation-id = {2022.03.09.483666},
	year = {2022},
	doi = {10.1101/2022.03.09.483666},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2022/03/12/2022.03.09.483666},
	eprint = {https://www.biorxiv.org/content/early/2022/03/12/2022.03.09.483666.full.pdf},
	journal = {bioRxiv}
}
@article {Bernhofer2022.06.12.495804,
	author = {Bernhofer, Michael and Rost, Burkhard},
	title = {TMbed {\textendash} Transmembrane proteins predicted through Language Model embeddings},
	elocation-id = {2022.06.12.495804},
	year = {2022},
	doi = {10.1101/2022.06.12.495804},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2022/06/15/2022.06.12.495804},
	eprint = {https://www.biorxiv.org/content/early/2022/06/15/2022.06.12.495804.full.pdf},
	journal = {bioRxiv}
}
@article{Marquet2021,
  doi = {10.1007/s00439-021-02411-y},
  url = {https://doi.org/10.1007/s00439-021-02411-y},
  year = {2021},
  month = dec,
  publisher = {Springer Science and Business Media {LLC}},
  author = {C{\'{e}}line Marquet and Michael Heinzinger and Tobias Olenyi and Christian Dallago and Kyra Erckert and Michael Bernhofer and Dmitrii Nechaev and Burkhard Rost},
  title = {Embeddings from protein language models predict conservation and variant effects},
  journal = {Human Genetics}
}
@article {Ilzhoefer2022.06.23.497276,
	author = {Ilzhoefer, Dagmar and Heinzinger, Michael and Rost, Burkhard},
	title = {SETH predicts nuances of residue disorder from protein embeddings},
	elocation-id = {2022.06.23.497276},
	year = {2022},
	doi = {10.1101/2022.06.23.497276},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2022/06/26/2022.06.23.497276},
	eprint = {https://www.biorxiv.org/content/early/2022/06/26/2022.06.23.497276.full.pdf},
	journal = {bioRxiv}
}
@article{littmann2021embeddings,
  title={Embeddings from deep learning transfer GO annotations beyond homology},
  author={Littmann, Maria and Heinzinger, Michael and Dallago, Christian and Olenyi, Tobias and Rost, Burkhard},
  journal={Scientific reports},
  volume={11},
  number={1},
  pages={1--14},
  year={2021},
  publisher={Nature Publishing Group}
}
@article{10.1093/nargab/lqac043,
    author = {Heinzinger, Michael and Littmann, Maria and Sillitoe, Ian and Bordin, Nicola and Orengo, Christine and Rost, Burkhard},
    title = "{Contrastive learning on protein embeddings enlightens midnight zone}",
    journal = {NAR Genomics and Bioinformatics},
    volume = {4},
    number = {2},
    year = {2022},
    month = {06},
    issn = {2631-9268},
    doi = {10.1093/nargab/lqac043},
    url = {https://doi.org/10.1093/nargab/lqac043},
    note = {lqac043},
    eprint = {https://academic.oup.com/nargab/article-pdf/4/2/lqac043/44245898/lqac043.pdf},
}
@article{10.1093/bioadv/vbab035,
    author = {Stärk, Hannes and Dallago, Christian and Heinzinger, Michael and Rost, Burkhard},
    title = "{Light attention predicts protein location from the language of life}",
    journal = {Bioinformatics Advances},
    volume = {1},
    number = {1},
    year = {2021},
    month = {11},
    issn = {2635-0041},
    doi = {10.1093/bioadv/vbab035},
    url = {https://doi.org/10.1093/bioadv/vbab035},
    note = {vbab035},
    eprint = {https://academic.oup.com/bioinformaticsadvances/article-pdf/1/1/vbab035/41640353/vbab035.pdf},
}

@article{littmann2021protein,
  title={Protein embeddings and deep learning predict binding residues for various ligand classes},
  author={Littmann, Maria and Heinzinger, Michael and Dallago, Christian and Weissenow, Konstantin and Rost, Burkhard},
  journal={Scientific Reports},
  volume={11},
  number={1},
  pages={1--15},
  year={2021},
  publisher={Nature Publishing Group}
}
@software{Weissenow_EMBER3D_2022,
  author = {Weissenow, Konstantin and Heinzinger, Michael and Rost, Burkhard},
  doi = {10.5281/zenodo.6837687},
  month = {7},
  title = {{EMBER3D}},
  url = {https://github.com/kWeissenow/EMBER3D},
  year = {2022}
}