Home

Awesome

DeNovoFoldDesign

Motivation: Facing data quickly accumulating on protein sequence and structure, this study is addressingthe following question: to what extent could current data alone reveal deep insights into the sequence-structure relationship, such that new sequences can be designed accordingly for novel structure folds?

Results: We have developed novel deep generative models, constructed low-dimensional andgeneralizable representation of fold space, exploited sequence data with and without paired structures,and developed ultra-fast fold predictor as an oracle providing feedback. The resulting semi-supervisedgcWGAN is assessed with the oracle over 100 novel folds not in the training set and found to generatemore yields and cover 3.6 times more target folds compared to a competing data-driven method (cVAE).Assessed with structure predictor over representative novel folds (including one not even part of basisfolds), gcWGAN designs are found to have comparable or better fold accuracy yet much more sequencediversity and novelty than cVAE. gcWGAN explores uncharted sequence space to design proteins bylearning from current sequence-structure data. The ultra fast data-driven model can be a powerful additionto principle-driven design methods through generating seed designs or tailoring sequence space.

Training-Process


Pre-requisite

* Anaconda 2.

* Environments:

To build the enviroments for this project, go to the Environments folder, then run

conda env create -f tensorflow_training.yml
conda env create -f DeepDesign_acc.yml

For the oracle (modified DeepSF), add the following function in the file <path where keras was installed>/keras/activations.py.

def leakyrelu(x, alpha=0.1, max_value=None):
    return K.relu(x, alpha=alpha, max_value=max_value)

* Backend of Keras:

In this project we utilized two backends of keras, the theano and tensorflow, which can be set in the file /.keras/keras.json.

{
    "epsilon": 1e-07,
    "floatx": "float32",
    "image_dim_ordering":"tf",
    "image_data_format": "channels_last",
    "backend": "tensorflow"
}
{
    "epsilon": 1e-07,
    "floatx": "float32",
    "image_dim_ordering":"tf",
    "image_data_format": "channels_last",
    "backend": "theano"
}

* Check Points:

Our check points were gotten after 100 epoch of training. If you have already downloaded our check points but want to retrain the model with the same hyper-parameters, the downlowded ones may be replaced if the training process reach the 100th epoch.


Table of contents:


Model Application

In this part you can apply our models to generate protein sequences according to a given protein fold (*.pdb file). With the scripts you can represent the givern fold with a 20 dimensional vector and send it to the generator for sequence generation. Go to the Model_Apply folder for more dtails.

Some examples of the generated sequences (10 sequences based on gcWGAN that pass the oracle):

>1
MIAPDQTIEKYVKFMAPVFTTTEYLKIVEMEEKGITTIAHGPVIHTARNPYAEVRLVSVTHELLIELQASGFLNISKTICLFETGIDENKEVLIDKDDYKEEPLLVDLFLEMEGPMDGQEIMTKLVRVPVMGQSLKPYAVKKAGVIKSAKHVG
>2
PCYALTVEAVENLLQAPAVRTLQKDEGLTPRLQPGIAAYASFIAGGAGCGLTRGSSDNMAKALIQEIEKTLRAVELTPATVQILVNNNEVKLPEKEKPNAIAKGILTVNLISKMDEFTKLVLVGENYTAILIDHIAKHKVGPV
>3
MCYDIAQSYLNFMMINGTVLIQTATRTLCPAVHSACRYDYIKVTAAKGNIVTDIGLMYFVRNMELVGPLMTATVAISKSIYTVQKATKETVNEMRTLQVAGTRTMFCRIYHVDMTKMMMQTGISIVGEKKPTRHDAEITYDQLAGHLVPLAHLKKL
>4
CTKAQRGVHKIYEVEKNYMPNRTLGDPNSLRIDSIGIRPVNERKDNTRYVAKKAKAILAKKDIMYCLPINIDVVKVTSTLDNYLDGDPYSKRPRFDDNLIKAVIPTDVALKPSPRYDVQAGRETPPAYTAVVQRFFSVKLNRL
>5
CPNVYQKLLYSMTEGPMDIGPVEVGQLLAVIPSAIGKVVSEITTSVHPAAPFEEAARVTAMAQRAALQYSTQTYLVGKESIALMYGKYRALHQDLARMVLADGQTADVQEVVPIIADIQRMHPAGQVAPRLIESGVVTASVLMTAA
>6
LLHGKLEVFHKCVAKADEASGLTFFHCGCSAYVTSEAAKGRYRPRACSTVHYFEKGATIPGLQYTNMYENAMVCTSKIRIYLEAMNMAPNVPLHRAAKYDNVSAALTANNNKVALIAEYYVTALLEGEVTQHLEEYKKNPPPELYEEIC
>7
MNKINIKYCPFNFNKVFRKEAFITQMAGENMAVLKELSEQIDHCSCFHKNTARQLLHRAEDGPVTEVETLLELRAAMICCFRRRAPRLVLGSSMSTTVITKCIAICTGQPYPGNGPPTTLGQPACSGVEVINNQAAIVIQTVEQRFILMTPGK
>8
CTVTAVQEFTENYGGLPLYVTRNQTLAPADKRLTPRYAGNFPEGAEVPAPNLAQTSPGVTYGKNIGRYLKNGLPDVAICTSPNLNLSGAYPDIVKYNYQQPEVFIRQYHPGNEMDVVKALEQFSSELLPGKTMSIVVNSYNNLADK
>9
CETTIDIEASVISQVIAVIVALTPIHKYAHASSKALASGASDVNVGPKLVAYIGKIAYSDPPIDLIPPVKVVVALLAPELAGVTAADYISYNEGKPATGESAGNAAFADGTTTIAPQRTIYEGEHKARINIITIADGAPLGSHEIP
>10
PEPDLVLTCTNLSFSAMVSCLRETSAFAGVEYAYNGIHPAGSCCLAAMKKGFFPHTEGMNALVIEPTPPVPCAPTKDLVQNKIQKAKLLPPAATTADEYSETLGQEDFLKLLTNPKITEKKKSPTTLILVTVNSELMISPVYFTGPLMKELLYHCNGEN

Training Process:

In the cWGAN folder and gcWGAN folder there are scripts for traning our two models. For cWGAN there are also scripts for validation (hyper-parameter tuning) and for gcWGAN there are also scripts for the Warmstart. Go to the cWGAN folder or gcWGAN folder for more details.

Some examples of the generated sequences during the training process:

fold a.39: vvaitfdnvhfpcshapltkaltvkklqvsannvsllvfddakmtkkidiekaikgfymmknnpqaqleiierftpttrgkpvikpiasftltspeilgkegykk!!!!!!!!!!!!!!!!!!!itkmlidavks!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.78: leemskvgntpaltyreardvavigifnngkqmksrddvtdeaddyqceidpisnllelgallpplhvaetkmllyykneakmhlfegag!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.240: tlaedippklpleveqcneiivdaqnkryvavgealllitcpmlqnnsmsttcgyrfeakskdgvicespeeglqndtthyachkraaavqiptekkttvyrlhacttklegcaeadnrvladvgldgivqravcdivttfsaevnp!!!!!!!!!!!!!
fold d.227: sckpglplvcagkkstyleklltgylvyslladyispkaleeavisekkpniampafatmpslvaddvtaliakkglqnaakcpndhmeiyeaeedpaiigqgynkhqgvgcnivvmagaipdeqkvenlrsliei!!!!!!!!!!!!!!!!!!!!!!!!
fold d.301: mtakstvqlpaeykgqniaeilnnvafnlaaivysattivayramacfpcgeknykeilgkvltlfidkhpiqnnr!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold d.223: mqtyeeavtlgltneqqtgknvtpiniaeekllvtnglvcqapalpvneevliklsentdnikpllciigkkseaispcsfraeeafdrsadymankatimcrkgnyaiilhsdgeellaihqtsgviirlghvpgkknrymppgaliplcngp!!!!!!
fold a.216: eelakrmiqrapdveligknkiatelkrlcllirgqtaanimnvillcataisvipkkskpasqyeetvnpadlakeiilqekkeaftriltteylvtsllkmypvhkvpkp!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold b.60: qpifvtykrlnrlallkshplhkdpkyltavlvmeldpsslpvavqpqrvvtiqsccpiiepsappeecdiqapnklkallendkptsqn!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
fold c.9: ayfereelilltpthggnepktldiptnlpakilgtplrkvklasqkellgeahpnnavstlideayylgdeqrevvvlteqekkagpidithyvngtegsckkpnisdsptphakafkqilkemqariqhhkelittalerlkn!!!!!!!!!!!!!!!
fold a.180: mpeecvctglepgevrrqngvipllnqgfhavltpagktylccttatknqvivhmfcqtaaeniyaeitvsylrtaatstylefmkhccqnvssihygiymslmdllkeyvveklv!e!!!!!!!!!!!!!!!!!!!iaeqipearkyaaalvg!!!!!!

Evaluate Model Performance:

This part contains the scripts we applied to evaluate the performance of our model. We also generate several sequences with the previouse state-of-art model cVAE and applied our evaluation method for comparison. Model evalustion consists of three part, model accuracy, sequence generating rate and sequence diversity and novelty, and for model accuracy we applied yield ratio calculation for all the training, validation and test folds. Go to the Model_Evaluation folder for more details.


Citation:

@article{gcWGAN,
author = {Karimi, Mostafa and Zhu, Shaowen and Cao, Yue and Shen, Yang},
title = {De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks},
journal = {Journal of Chemical Information and Modeling},
volume = {60},
number = {12},
pages = {5667-5681},
year = {2020},
doi = {10.1021/acs.jcim.0c00593},
note ={PMID: 32945673},
URL = {https://doi.org/10.1021/acs.jcim.0c00593},
eprint = {https://doi.org/10.1021/acs.jcim.0c00593}
}

Contacts:

Yang Shen: yshen@tamu.edu

Mostafa Karimi: mostafa_karimi@tamu.edu

Shaowen Zhu: shaowen1994@tamu.edu

Yue Cao: cyppsp@tamu.edu