Awesome
Neural Cross-Lingual Named Entity Recognition with Minimal Resources
This is the code we used in our paper
Neural Cross-Lingual Named Entity Recognition with Minimal Resources
Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A. Smith, Jaime Carbonell
EMNLP 2018
Requirements
Python 2.7 or 3.6
PyTorch >= 0.3.0
Theano 1.0
Lasagne 0.2
The original results of the paper are tuned and obtained using the NER model written in Theano/Lasagne. Everything else is in PyTorch. We also provide a PyTorch implementation of the NER model, which might produce slightly worse results, due to implementation differences between the libraries such as different weight initialization schemes.
Train Bilingual Word Embeddings
To train bilingual word embeddings, we use MUSE.
After installing MUSE, to get a mapping (e.g., en-es, identical character strings), first set VALIDATION_METRIC = 'mean_cosine-csls_knn_10-S2T-10000'
in supervised.py
, and then run, for instance:
python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 3 --dico_train identical_char --max_vocab 100000
which will produce a mapping at a location such as /your_path/MUSE/dumped/debug/qbun3algl8/best_mapping.pth
To create a word-to-word translation file, run:
./run_load_muse.sh
Note, if your embedding file contains a 1st line that specifies the size and the dimension of the embedding file, such as 2519370 300
, remove it before you run this script (include it though when running MUSE).
Data Format
We use IOB2 tagging scheme, and NER data in the following format:
Peter B-PER
Blackburn I-PER
Transfer Training Data
Simply run:
./run_transfer_training_data.sh
Train Cross-Lingual NER Model
For the Lasagne/Theano implementation, to reproduce our results, run:
./run_lasagne_ncrf.sh
For the PyTorch implementation, run:
./run_pytorch_ncrf.sh