Home

Awesome

laserembeddings – test data

This repository contains the script used to generate the data required for testing laserembeddings, a pip-packaged, production-ready port of Facebook Research's LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.

The data contains a corpus of multilingual sentences with their embeddings computed with Facebook's LASER original implementation. It is used during the tests of laserembeddings to make sure the embeddings computed with laserembeddings match the ones obtained with Facebook's LASER.

Usage

First install LASER. Make sure that MeCab is also installed and is configured to support UTF-8.

Then export the path to LASER's installation directory (i.e. where you cloned LASER's repository).

export LASER=/path/to/laser

Install additional dependencies (iso639 and yapf):

pip install iso639 yapf

Then generate that data!

python generate-laserembeddings-test-data.py

The test data (laserembeddings-test-data.npz) is placed into the test-data directory. For ease of distribution, a .tar.gz archive is also created containing the .npz file along with the README and the LICENSE file of the test data.

The test data is generated from a version of the Tatoeba corpus refined by Facebook Research, located in LASER's installation directory ($LASER/data/tatoeba/v1). Refer below for more information.

Test data description & license

Check out the README of the test-data directory to know more about the contents of the generated files, their license and the data they are based on.