Awesome

dom2vec: Protein domain embeddings

Please note: repository in WIP, each folder indicated by WIP will be updated soon.

All protein domains analysis follows the data from Interpro version 75.0. All data associated can be found at the ftp site for the version 75.0, accessible from the general download site. All analysis is decipted in the following image:

Summary of approach

Summary of approach Summary of the approach/code divided in four parts, building two forms of domain architectures, training domain embeddings, performing intrinsic and extrinsic evaluation of embeddings.

Main dependencies

Code was executed using a conda environment, of which the full list of dependencies is in conda_env_dependencies.txt.

The main dependencies are listed below:

Python 3.7.6
BioPython 1.74
Gensim 3.8.0
Pytorch 1.2.0
Torchtext 0.4.0
Numpy 1.18.1
Pandas 1.0.1
Scikit-learn 0.22.1
Matplotlib 3.1.1
Intervaltree 3.0.2
Treelib 1.5.5

Build protein domain architectures

Data acquisition:

For Interpro 75.0 version download the files:

match_complete.xml.gz
protein2ipr.dat.gz

Get protein lengths parsing match_complete.xml:

change folder/files paths appropriately in proteinXMLHandler_run.py
run proteinXMLHandler_run.py
prot_id_len tabular file will be created; a sample of the first 100 lines of the full file is saved at sample file

Get domains and evidence db id per protein:

select the output domain annotation type: overlap, non overlapping or non redundant. Then set if GAP domain is also added to annotations. Change folder/files paths appropriately and uncomment the first section in main.py
parse domain hits per protein running main.py
id_domains_type.tab file will be created; a sample of the first 100 lines of the full file, for non overlapping with GAP, is saved at sample file

Get domain architecture corpus:

change folder/files paths appropriately and uncomment the first section in main.py
run main.py
domains_corpus_type.txt file will be created; sample of the first 100 line of the full file, for non overlapping with GAP, is saved at sample file

Train protein domain embeddings

Needed data:

the domains_corpus_type.txt from last step

Train word2vec model from domain architectures corpus:

change folder/files paths appropriately in word2vec_run.py
change the paths and the training parameters in the provided bash script run_embs.sh
run run_embs.sh
word2vec embedding standard txt file(s) will be created

Intrinsic evaluation

Data and example running experiments for:

Domain hierarchy

Data acquisition:
- For Interpro 75.0 version, download the ParentChildTreeFile.txt file
Parse the parent child relation:
- uncomment the domain hierarchy section in intrinsic_eval_run.py
- parse parent child using parse_parent_child_file()
- interpro_parsed_tree.txt will be created; the first 3 Interpro parents of the full parsed tree is saved at sample file
Run evaluation
- run evaluation with the rest section using the looped get_nn_calculate_precision_recall_atN()
- the outputs will be: average recall value, recall histogram png, diagnostic histogram for parents with recall 0 (if parameter is selected)
- example outputs can be found respectively at table 1, Figure S1 and S2 in the below bioRxiv manuscript

SCOPe and EC

Data acquisition:
- For InterPro 75.0 version, download and decompress interpro.xml.gz file
Parse interpro.xml:
- uncomment the EC & SCOPe section in intrinsic_eval_run.py
- parse xml to get available SCOPe and EC labels per domain using parse_and_save_EC_SCOP()
- interpro2EC_SCOPe.tab will be created; a sample of the first 100 lines of the full file is saved at sample file
Run evaluation
- initialize EC_SCOP_Evaluate() class for evaluation using EC or SCOPe
- run evaluation with the rest section using the looped run_classification()
- average test accuracy over 5-fold cross validation will be printed; example values can be found in Tables 2 and 3 in the below bioRxiv manuscript

GO molecular function

Data acquisition:
- For Interpro 75.0 version, download the interpro2go file and add the suffix .txt

For each organism: malaria, ecolik12, yeast, human follow the steps:

Parse interpro2go.txt:
- uncomment the GOEvaluate section in intrinsic_eval_run.py
- parse the txt file using convert_go_labels() producing:
- interpro2go_organism_MF.tab containing unprocessed available GO MF labels per domain; a sample of the first 100 lines of the full file for yeast is saved at sample file
- interpro2go_yeast_MF_labels.csv containing GO MF labels after abstracting them; a sample of the first 100 lines of the full file for yeast is saved at sample file
Run evaluation
- initialize GOEvaluate() class for evaluation in selected organism
- run evaluation with the rest section using the looped run_classification()
- average test accuracy over 5-fold cross validation will be printed; example can be found in Table 4 in the below bioRxiv manuscript

Downstream evaluation

Extract non-redundant domains from proteins in data set
- match protein id to the tabular file with domain architectures, result of step 2 of building domain architectures, to get domains for each protein as shown in fasta2csv
For the remaining proteins, run intreproscan and convert annotation to selected type of domain annotation:
- install interproscan as discussed in interProScan Wiki
- run interproscan with data set proteins as input, the output is a tsv file
- gzip the tsv and parse as in parse_prot2in, the output is a tabular file (same columns as of the previous step tabular file)
- run again the fasta2csv in fasta2csv after interproscan
For the rest of protein without identified domains created a default domain per protein as shown in fasta2default
- update data set protein domains running fasta2csv for last time, as shown in fasta2csv after default domains
Preprocess data sets for learning:
- split train and test
- create inner cross validation from the training set as shown in create data set splits

Data and example code to run cross validation and performance experiments for three data sets:

TargetP
Toxin
NEW are found at the downstream evaluation folder

Pretrained dom2vec

Pretrained dom2vec embeddings can be downloaded from the Research Data portal of Leibniz University Hannover at dom2vec_pretrained.

Citation

This repository is the implementation of the research work: "Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures" (link).

Please, cite as:

@article{melidis2021capturing,
  title={Capturing Protein Domain Structure and Function Using Self-Supervision on Domain Architectures},
  author={Melidis, Damianos P and Nejdl, Wolfgang},
  journal={Algorithms},
  volume={14},
  number={1},
  pages={28},
  year={2021},
  publisher={Multidisciplinary Digital Publishing Institute}
}