Home

Awesome

OntoEMMA ontology matcher

Note: This library is no longer being maintained.

This ontology matcher can be used to generate alignments between knowledgebases.

Installation

Go to the base git directory and run: ./setup.sh

This will create an ontoemma conda environment and install all required libraries.

Train OntoEmma

To train an alignment model, use train_ontoemma.py. The wrapper takes the following arguments:

Example usage:

python train_ontoemma.py -p nn -m model_path -c configuration_file.json

This script will then use the train function in OntoEmma.py to train the model.

OntoEmma module

The module OntoEmma is used for accessing the training and alignment capabilities of OntoEmma.

Train mode

In training mode, the OntoEmma module can use the OntoEmmaLRModel logistic regression module or AllenNLP to train the model:

NN with AllenNLP:

Configuration file:

When training other models with OntoEmmaModel, the module performs the following:

Align mode

In alignment mode, the OntoEmma module performs the following:

If using NN model with AllenNLP:

If using logistic regression model:

For all models following predictions:

Candidate selection module

The module CandidateSelection is used to select candidate matched pairs from the source and target KBs.

CandidateSelection is initialized with the following inputs:

The module builds the following token maps:

Candidates are accessed through the select_candidates method, which takes an input research_entity_id from the source KB and returns an ordered list of candidates from the target KB. The candidates are ordered by the sum of their token IDF scores.

The output of the CandidateSelection module is evaluated using the eval method, which takes as input:

The eval method compares the candidates generated against the gold standard mappings, returning the following:

Feature generation module

The module FeatureGenerator is used to generate features from a candidate pair.

FeatureGenerator is initialized with the following inputs:

The module generates word and character-based n-gram tokens of entity aliases and the canonical names of entity parents and children. The module also uses a nltk stemmer and lemmatizer to produce stemmed and lemmatized version of canonical name tokens.

The calculate_features method is the core of this module. It generates a set of pairwise features between two input entities given by their respective entity ids from the source and target KBs. This is returned as the feature vector used in the OntoEmmaModel.

UMLS training data

The module extract_training_data_from_umls is used to extract KB and concept mapping data from UMLS for use in ontology matching training and evaluation.

extract_training_data_from_umls takes as inputs:

UMLS data subsets are currently located at /net/nfs.corp/s2-research/scigraph/data/ontoemma/2017AA_OntoEmma/. OUTPUT_DIR defaults to /net/nfs.corp/s2-research/scigraph/data/ontoemma/umls_output/.

extract_training_data_from_umls produces as output:

Data downloads

You will need to download a copy of the UMLS Metathesaurus by following these instructions: https://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html

Contexts extracted from the Semantic Scholar API for some UMLS KBs are available in this S3 bucket.

aws s3 cp s3://ai2-s2-ontoemma/contexts/ data/kb_contexts/ --recursive

Once you have downloaded both datasets, update the corresponding path variables in emma/paths.py to point to the appropriate directories.

Sampling negative data

Hard negatives are sampled using the CandidateSelection module, selecting from candidate pairs that are negative matches. Easy negatives are sampled randomly from the rest of the KB. Currently, 5 hard negatives and 5 easy negatives are sampled for each positive match.

Mapping file format

Mapping files are of the format described in Data format: KB alignment file. For UMLS positive mappings, the provenance is given as <UMLS_header>:<CUI>; for UMLS negative mappings, as <UMLS_header>. Example data consisting of two positive and two negative mappings:

CPT:90281 DRUGBANK:DB00028 1 UMLS2017AA:C0358321 CPT:90283 DRUGBANK:DB00028 1 UMLS2017AA:C0358321 CPT:83937 DRUGBANK:DB00426 0 UMLS2017AA CPT:1014233 DRUGBANK:DB05907 0 UMLS2017AA

Run OntoEmma (align two KBs using trained model)

To run OntoEmma, use run_ontoemma.py. The wrapper implements the following arguments:

Example usage:

python run_ontoemma.py -p nn -m model_path -s source_ont.owl -t target_ont.owl -i input_alignment.tsv -o output_alignment.tsv -g 0

This script assumes that the model has been pre-trained, and uses align functions in OntoEmma.py accordingly.

Human annotations for evaluation

Available here

Other

String processing utilities are found in string_utils.py.

Constants used by OntoEmma are found in constants.py. Input training data and training model parameters are specified in the training configuration files in config/.

This script will then use the train function in OntoEmma.py to train the model. If no GPU is specified, the program defaults to CPU.