Awesome
Code for running the entity linking model. This is part of the code for the xelms project.
Requirements
- pytorch (0.2.0+21f8ad4): installed from source, and patched for sparse tensor operations (instructions below).
- python3.
- cogcomp-nlpy.
- Download the resources and trained models here and place them in the folder
xling-el/data
. Right now, pre-trained models are available for German, Spanish, French, Italian, and Chinese.
Resources for Candidate Generation
- First set up candidate generation and other resources as described in projects wikidump_preprocessing and wiki_candgen.
- A mongo daemon needs to be running. This is where the resources generated in wiki_candgen will be kept for fast (and parallel) access.
Note: These resources are provided in the resources directory downloaded in step 4. above, so ideally you do not need to regenerate them, unless you plan to use a newer Wikipedia dump or a larger knowledge base.
Patching Pytorch for Sparse Tensor Operations
This is best done in a new conda environment.
- First checkout the
sparse_patch
branch from this repository.
git clone https://github.com/shyamupa/pytorch
cd pytorch
git checkout sparse_patch
- Install the patched code from source using the following commands,
export CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" # [anaconda root directory]
# Install basic dependencies
conda install numpy pyyaml mkl mkl-include setuptools cmake cffi typing
conda install -c mingfeima mkldnn
cd pytorch_patched
python setup.py install
Ensure that the patched pytorch was successfully installed,
>>> import torch
>>> torch.__version__
'0.2.0+43662e7'
Mention Detection using NER
- For German, Spanish, French and Italian, download relevant Spacy NER Models
pip install spacy
python -m spacy download de_core_news_sm
python -m spacy download es_core_news_md
python -m spacy download fr_core_news_md
python -m spacy download it_core_news_sm
- For Chinese, download stanford corenlp jar and the chinese model jar and place them in a
stanford_jars
directory.
$ ls stanford_jars/
stanford-corenlp-full-2018-10-05
$ ls stanford_jars/stanford-corenlp-full-2018-10-05
...
...
stanford-chinese-corenlp-2018-10-05-models.jar
...
And set the bash environment variable CORENLP_HOME
to path/to/stanford_jars/stanford-corenlp-full-2018-10-05
.
export CORENLP_HOME=path/to/stanford_jars/stanford-corenlp-full-2018-10-05
Running the Model
To run the model, use the command,
./run_inference_on_doc.sh <lang> <infile> <outfile>
For instance, for running on a German document test_docs/de_doc.txt
, one would run
./run_inference_on_doc.sh de test_docs/de_doc.txt test_docs/de_doc_output.txt
The json output will be produced in test_docs/de_doc_output.txt
.
Output
The output file is a json serialized text annotation, with a view named NEURAL_XEL_<lang>
. The view consists of a list of the
constituents that have been linked to a Wikipedia title. Below is the output for the German test document provided in the repo,
...
"viewName": "NEURAL_XEL_de",
...
...
"constituents": [
{
"end": 2,
"label": "en.wikipedia.org/wiki/Angela_Merkel",
"score": 0.5128146075318596,
"start": 0,
"tokens": "Angela Merkel"
},
{
"end": 5,
"label": "NULLTITLE",
"score": 0.05000000074505806,
"start": 4,
"tokens": "Elim-Krankenhaus"
},
...
The label field for each constituent is the predicted Wikipedia entity for the span identified by the start
and end
token index.
Here a label of NULLTITLE
means that the named entity detected by the mention detection system could not be linked to any entity.