


Named Entity Disambiguation for Noisy Text

This repository contains code for reproducing our experiments in our "Named Entity Disambiguation for Noisy Text" submittion to CoNLL 2017. Some manual steps are required to setup the data for the experiments

python libraries required for the project: keras, tensorflow, numpy, nltk, json, unicodedata, unidecode, mysql.connector, urllib, matplotlib, zipfile, ujson, pandas, urlparse, sklearn.

For running the CoNLL test

Seting up the data run ./setup_data.sh to setup data for the wikilinksNED evaluation run ./setup_wikilinks.sh <db_user> <db_pass> <db_schema> <db_ip> to setup data for the conll evaluation run ./setup_conll.sh <db_user> <db_pass> <db_schema> <db_ip>

REMARK: as of now you are required to use the script at https://github.com/NoamGit/Wiki2JSON/ to extract Wikilinks from its original format (thrift) to an easier to work with json format. The resulting files should be stored at data/wikilinks/unprocessed. We are working to automate this process as soon as possible.

Running evaluations for running WikilinksNED evaluation run ./evaluateWikilinksNED.sh <db_user> <db_pass> <db_schema> <db_ip> for running CoNLL evaluation run ./pretrainOnWikipedia.sh <db_user> <db_pass> <db_schema> <db_ip> and then ./evaluateCoNLL.sh <db_user> <db_pass> <db_schema> <db_ip>

results are written to evaluation.txt file in the main directory


This code is provided as-is. Running this code or any part of it is at your own risk. We do not take any responsibility for running any of the code or usage of any of the data. Much of this code was written at the same time as I was learning Keras, TF and Theano. It has undergone many changes and was used for extensive experimentation. It is therefore probably full of design flaws and redundancies.