Home

Awesome

Transfer Learning for Entity Recognition of Novel Classes

This repository contains code, datasets, and results for the paper:

Rodriguez, Caldwell and Liu, "Transfer Learning for Entity Recognition of Novel Classes". COLING, 2018.

This paper can be downloaded from http://aclweb.org/anthology/C18-1168

In addition to this, it includes:

NOTE: datasets (beyond those used in this project) and associated utility functions are also located at: https://github.com/juand-r/entity-recognition-datasets

Set up

This code was tested and run on Linux, using Python 2. The list of required Python packages is in requirements.txt. In addition, keras-contrib (2.0.8) needs to be installed. keras-contrib2.0.8.tar.gz is included in this repository. Extract the contents of keras-contrib2.0.8.tar.gz and run the following to install keras-contrib:

python setup.py install

To run the experiments, first make sure all the datasets are in the appropriate directory. The files for the CONLL 2003 dataset (eng.train, eng.testa, eng.testb) in particular must be placed in directory data/conll2003. See the README files in the data directory for more information.

The BiLSTM-CRF experiments require pre-trained word embeddings. We used the GloVe word embeddings. Download the pretrained Stanford GloVe embeddings from http://nlp.stanford.edu/data/glove.6B.zip and put file glove.6B.100d.txt.gz in src/word_embeddings.

Datasets

Due to licensing restrictions, GUM and re3d were the only datasets from our paper that could be included in this repository. Since each of the datasets comprising re3d has a different license, the train/test split of re3d used in our paper is not included. However, it can be easily generated following the directions in data/re3d/CONLL-format/data/README.md.

Instructions for obtaining the other datasets used in the paper are found in each of the corresponding dataset directories, together with directions for where to place them. The file locations will correspond to those listed in the file src/file_locations.cfg. Ritter's Twitter dataset, the MIT Movie Corpus and the MIT Restaurant Corpus can be downloaded and are already in the CONLL 2003 format. The remaining datasets are in different formats; tools are included to convert them to the CONLL 2003 format.

In addition, we include instructions for obtaining several other NER datasets not used in the COLING paper, but which may be of interest. These are:

Of these, only Wikigold, AnEM, WNUT 2017 and SEC-filings could be shared due to licensing restrictions. New NER datasets will be added to https://github.com/juand-r/entity-recognition-datasets

Dataset licenses

For a summary of the dataset licenses, see data/LICENSES_SUMMARY.rst. Each data directory also includes the license for that dataset.

Reproducing the experiments

The directory src/experiments contains several subdirectories of the form CONLL03_to_X, where X is the name of the target corpus. Each of these subdirectories contains a .cfg configuration file specifying which parameters to use. These include source corpus, target corpus, random seeds, number of training sentences, algorithms to evaluate, and transfer learning method. The transfer learning methods do not include the neural methods (these are run separately because they take more time).

At the moment, the experiments must be run in the following order:

CRF experiments

To run the CRF experiments (CRF-TGT, PRED, and PRED-CCA), run for example:

import experiment
experiment.run_experiment('CONLL03_to_GUM')

This will run all the specified experiments from the configuration file and save the results in directory CONLL03_to_GUM. The results will contain both the scores for each run (in a results.txt file) and the raw predictions as well (in file predicted.conll), as well as the macro- and micro- averaged results in a pkl file, which can be loaded through pandas.

BiLSTM-CRF experiments

Once the CRF experiments have been run, one may run the BiLSTM-CRF experiments. This is done in two steps. First, train the network on the source (CONLL 2003) corpus (this only needs to be done once, since the vocabulary for the word embeddings is the union of the vocabulary of the source and the vocabulary of each possible target dataset). This is done via:

import train_bilstm_model as tbm
max_len, we, w2i, words = tbm.get_embeddings()
history, score = tbm.fit_and_test_model(max_len, we, w2i, words)

Then, to fine-tune the neural network, run:

import load_pretrained
load_pretrained.make_reports(tgt_corpus, 'rmsprop_v3sgd', tlayers)

where tgt_corpus is 'GUM', 'TwitterRitter', 're3d', 'NIST99', 'MUC6', 'MITRestaurant', 'MITMovie', 'i2b2-14', 'i2b2-06' or 'CADEC', and tlayers can be 'N' (no transfer, train from scratch), 'E' (transfer the embedding layer only) or 'EL' (transfer both the embedding layer and biLSTM layer).

To replicate the results of the paper, run the code using both 'N' and 'EL'. This will create files results_pretrainEL_rmsprop_v3sgd.pkl and results_pretrainN_rmsprop_v3sgd.pkl in the appropriate directory. These can be opened with pandas, and contain the averaged scores for these runs.

In addition, the scores for each run will be saved in a results.txt file, and the raw predictions will also be saved in predicted.conll files.