Awesome

Cross-lingual word embeddings from Twitter

The following repository includes the pre-trained monolingual and cross-lingual word embeddings from the paper Learning Cross-lingual Embeddings from Twitter via Distant Supervision.

Twitter pre-trained word embeddings

We release the 100-dimension monolingual and cross-lingual word embeddings trained on Twitter used in our experiments (English, Spanish, Italian, German and Farsi):

Monolingual FastText embeddings: Available here
Cross-lingual embeddings post-processed with plain averaging: Available here
Cross-lingual embeddings post-processed with weighted averaging: Available here

Update: Embeddings for Finnish and Japanese now also available!

Note 1: All words are lowercased.

Note 2: All emoji have been unified into a single neutral encoding across languages (no skin tone modifiers). All Twitter users have been anonymized with @user.

Reference paper

If you use any of these resources, please cite the following paper:

@inproceedings{xlingtwitter2020icwsm,
  author = 	"Camacho-Collados, Jose and Doval, Yerai and Mart\'{i}nez-C\'{a}mara, Eugenio and Espinosa-Anke, Luis and Barbieri, Francesco and Schockaert, Steven",
  title = 	"Learning Cross-lingual Embeddings from Twitter via Distant Supervision",
  booktitle = 	"Proceedings of ICWSM",
  location = 	"Atlanta, United States",
  year = 	"2020"
}

If you use Fasttext or VecMap, please also cite their corresponding papers.