Home

Awesome

CrossLingualELMo

Cross-Lingual Alignment of Contextual Word Embeddings

This repo will contain the code and models for the NAACL19 paper - Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing

More pieces of the code will be released soon.

Updates:

Aligned Multi Lingual Deep Contextual Word Embeddings

Embeddings

The following models were trained on Wikipedia. We provide the alignment of the first LSTM output of ELMo to English. The English file contains the identity matrix divided by the average norm for that layer.

LanguageModel weightsAlignment matrix (First LSTM layer) *
Englishweights.hdf5en_best_mapping.pth
Spanishweights.hdf5es_best_mapping.pth
Frenchweights.hdf5fr_best_mapping.pth
Italianweights.hdf5it_best_mapping.pth
Portugueseweights.hdf5pt_best_mapping.pth
Swedishweights.hdf5sv_best_mapping.pth
Germanweights.hdf5de_best_mapping.pth

* Alignments for layer 0 (pre LSTM) and layer 2 (post LSTM) for all above languages - alignments_0_2.zip

Download helpers:

Generating anchors

In order to generate your own anchors - use the gen_anchors.py script to generate your own anchors. You will need a trained ELMo model, text files with one sentence per line, and vocab file with token per line containing the tokens that you wish to calculate for. run gen_anchors.py -h for more details.

Usage

Generating aligned contextual embeddings

Given the output of a specific layer from ELMo (the contextual embeddings), run:

aligning  = torch.load(aligning_matrix_path)
aligned_embeddings = np.matmul(embeddings, aligning.transpose())

An example can be seen in demo.py.

Replicating the zero-shot cross-lingual dependency parsing results

  1. Create an environment to install our fork of allennlp:
virtualenv -p /usr/bin/python3.6 allennlp_env

or, if you are using conda:

conda create -n allennlp_env python=3.6
  1. Activate the environment and install allennlp:
source allennlp_env/bin/activate
pip install -r requirements.txt
  1. Download the uni-dep-tb dataset (version 2) and follow the instructions to generate the English PTB data
  2. Train the model (the provided configuration is for 'es' as a target language):
TRAIN_PATHNAME='universal_treebanks_v2.0/std/**/*train.conll' \
DEV_PATHNAME='universal_treebanks_v2.0/std/**/*dev.conll' \
TEST_PATHNAME='universal_treebanks_v2.0/std/**/*test.conll' \
allennlp train training_config/multilang_dependency_parser.jsonnet -s path_to_output_dir;

Using in any model

The aligments can be used with the AllenNLP framework by simply using any model with ELMo embeddings and replacing the paths in the configuration with our provided models.

Each ELMo model was trained on Wikipedia of the relevant language. To align the models, you will need to add the following code to your model:

Load the alignment matrix in the __init__() function:

aligning_matrix_path = ... (pth file)
self.aligning_matrix = torch.FloatTensor(torch.load(aligning_matrix_path))
self.aligning = torch.nn.Linear(self.aligning_matrix[0], self.aligning_matrix[1], bias=False)
self.aligning.weight = torch.nn.Parameter(self.aligning_matrix, requires_grad=False)

Then, simply apply the alignment on the embedded tokens in the forward() pass:

embedded_text = self.aligning(embedded_text)

Citation

If you find this repo useful, please cite our paper.

@InProceedings{Schuster2019,
    title = "Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing",
    author = "Schuster, Tal  and
      Ram, Ori  and
      Barzilay, Regina  and
      Globerson, Amir",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1162",
    pages = "1599--1613"
}