Home

Awesome

MUSE: Multilingual Unsupervised and Supervised Embeddings

Model

MUSE is a Python library for multilingual word embeddings, whose goal is to provide the community with:

We include two methods, one supervised that uses a bilingual dictionary or identical character strings, and one unsupervised that does not use any parallel data (see Word Translation without Parallel Data for more details).

Dependencies

MUSE is available on CPU or GPU, in Python 2 or 3. Faiss is optional for GPU users - though Faiss-GPU will greatly speed up nearest neighbor search - and highly recommended for CPU users. Faiss can be installed using "conda install faiss-cpu -c pytorch" or "conda install faiss-gpu -c pytorch".

Get evaluation datasets

To download monolingual and cross-lingual word embeddings evaluation datasets:

You can simply run:

cd data/
wget https://dl.fbaipublicfiles.com/arrival/vectors.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/wordsim.tar.gz
wget https://dl.fbaipublicfiles.com/arrival/dictionaries.tar.gz

Alternatively, you can also download the data with:

cd data/
./get_evaluation.sh

Note: Requires bash 4. The download of Europarl is disabled by default (slow), you can enable it here.

Get monolingual word embeddings

For pre-trained monolingual word embeddings, we highly recommend fastText Wikipedia embeddings, or using fastText to train your own word embeddings from your corpus.

You can download the English (en) and Spanish (es) embeddings this way:

# English fastText Wikipedia embeddings
curl -Lo data/wiki.en.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.en.vec
# Spanish fastText Wikipedia embeddings
curl -Lo data/wiki.es.vec https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.es.vec

Align monolingual word embeddings

This project includes two ways to obtain cross-lingual word embeddings:

For more details on these approaches, please check here.

The supervised way: iterative Procrustes (CPU|GPU)

To learn a mapping between the source and the target space, simply run:

python supervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5 --dico_train default

By default, dico_train will point to our ground-truth dictionaries (downloaded above); when set to "identical_char" it will use identical character strings between source and target languages to form a vocabulary. Logs and embeddings will be saved in the dumped/ directory.

The unsupervised way: adversarial training and refinement (CPU|GPU)

To learn a mapping using adversarial training and iterative Procrustes refinement, run:

python unsupervised.py --src_lang en --tgt_lang es --src_emb data/wiki.en.vec --tgt_emb data/wiki.es.vec --n_refinement 5

By default, the validation metric is the mean cosine of word pairs from a synthetic dictionary built with CSLS (Cross-domain similarity local scaling). For some language pairs (e.g. En-Zh), we recommend to center the embeddings using --normalize_embeddings center.

Evaluate monolingual or cross-lingual embeddings (CPU|GPU)

We also include a simple script to evaluate the quality of monolingual or cross-lingual word embeddings on several tasks:

Monolingual

python evaluate.py --src_lang en --src_emb data/wiki.en.vec --max_vocab 200000

Cross-lingual

python evaluate.py --src_lang en --tgt_lang es --src_emb data/wiki.en-es.en.vec --tgt_emb data/wiki.en-es.es.vec --max_vocab 200000

Word embedding format

By default, the aligned embeddings are exported to a text format at the end of experiments: --export txt. Exporting embeddings to a text file can take a while if you have a lot of embeddings. For a very fast export, you can set --export pth to export the embeddings in a PyTorch binary file, or simply disable the export (--export "").

When loading embeddings, the model can load:

The two first options are very fast and can load 1 million embeddings in a few seconds, while loading text files can take a while.

Download

We provide multilingual embeddings and ground-truth bilingual dictionaries. These embeddings are fastText embeddings that have been aligned in a common space.

Multilingual word Embeddings

We release fastText Wikipedia supervised word embeddings for 30 languages, aligned in a single vector space.

Arabic: textBulgarian: textCatalan: textCroatian: textCzech: textDanish: text
Dutch: textEnglish: textEstonian: textFinnish: textFrench: textGerman: text
Greek: textHebrew: textHungarian: textIndonesian: textItalian: textMacedonian: text
Norwegian: textPolish: textPortuguese: textRomanian: textRussian: textSlovak: text
Slovenian: textSpanish: textSwedish: textTurkish: textUkrainian: textVietnamese: text

You can visualize crosslingual nearest neighbors using demo.ipynb.

Ground-truth bilingual dictionaries

We created 110 large-scale ground-truth bilingual dictionaries using an internal translation tool. The dictionaries handle well the polysemy of words. We provide a train and test split of 5000 and 1500 unique source words, as well as a larger set of up to 100k pairs. Our goal is to ease the development and the evaluation of cross-lingual word embeddings and multilingual NLP.

European languages in every direction

src-tgtGermanEnglishSpanishFrenchItalianPortuguese
German-full train testfull train testfull train testfull train testfull train test
Englishfull train test-full train testfull train testfull train testfull train test
Spanishfull train testfull train test-full train testfull train testfull train test
Frenchfull train testfull train testfull train test-full train testfull train test
Italianfull train testfull train testfull train testfull train test-full train test
Portuguesefull train testfull train testfull train testfull train testfull train test-

Other languages to English (e.g. {fr,es}-en)

Afrikaans: full train testAlbanian: full train testArabic: full train testBengali: full train test
Bosnian: full train testBulgarian: full train testCatalan: full train testChinese: full train test
Croatian: full train testCzech: full train testDanish: full train testDutch: full train test
English: full train testEstonian: full train testFilipino: full train testFinnish: full train test
French: full train testGerman: full train testGreek: full train testHebrew: full train test
Hindi: full train testHungarian: full train testIndonesian: full train testItalian: full train test
Japanese: full train testKorean: full train testLatvian: full train testLittuanian: full train test
Macedonian: full train testMalay: full train testNorwegian: full train testPersian: full train test
Polish: full train testPortuguese: full train testRomanian: full train testRussian: full train test
Slovak: full train testSlovenian: full train testSpanish: full train testSwedish: full train test
Tamil: full train testThai: full train testTurkish: full train testUkrainian: full train test
Vietnamese: full train test

English to other languages (e.g. en-{fr,es})

Afrikaans: full train testAlbanian: full train testArabic: full train testBengali: full train test
Bosnian: full train testBulgarian: full train testCatalan: full train testChinese: full train test
Croatian: full train testCzech: full train testDanish: full train testDutch: full train test
English: full train testEstonian: full train testFilipino: full train testFinnish: full train test
French: full train testGerman: full train testGreek: full train testHebrew: full train test
Hindi: full train testHungarian: full train testIndonesian: full train testItalian: full train test
Japanese: full train testKorean: full train testLatvian: full train testLittuanian: full train test
Macedonian: full train testMalay: full train testNorwegian: full train testPersian: full train test
Polish: full train testPortuguese: full train testRomanian: full train testRussian: full train test
Slovak: full train testSlovenian: full train testSpanish: full train testSwedish: full train test
Tamil: full train testThai: full train testTurkish: full train testUkrainian: full train test
Vietnamese: full train test

References

Please cite [1] if you found the resources in this repository useful.

Word Translation Without Parallel Data

[1] A. Conneau*, G. Lample*, L. Denoyer, MA. Ranzato, H. Jégou, Word Translation Without Parallel Data

* Equal contribution. Order has been determined with a coin flip.

@article{conneau2017word,
  title={Word Translation Without Parallel Data},
  author={Conneau, Alexis and Lample, Guillaume and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1710.04087},
  year={2017}
}

MUSE is the project at the origin of the work on unsupervised machine translation with monolingual data only [2].

Unsupervised Machine Translation With Monolingual Data Only

[2] G. Lample, A. Conneau, L. Denoyer, MA. Ranzato Unsupervised Machine Translation With Monolingual Data Only

@article{lample2017unsupervised,
  title={Unsupervised Machine Translation Using Monolingual Corpora Only},
  author={Lample, Guillaume and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  journal={arXiv preprint arXiv:1711.00043},
  year={2017}
}

Related work

Contact: gl@fb.com aconneau@fb.com