Awesome
Cross-lingual word embeddings from Twitter
The following repository includes the pre-trained monolingual and cross-lingual word embeddings from the paper Learning Cross-lingual Embeddings from Twitter via Distant Supervision.
Twitter pre-trained word embeddings
We release the 100-dimension monolingual and cross-lingual word embeddings trained on Twitter used in our experiments (English, Spanish, Italian, German and Farsi):
- Monolingual FastText embeddings: Available here
- Cross-lingual embeddings post-processed with plain averaging: Available here
- Cross-lingual embeddings post-processed with weighted averaging: Available here
Update: Embeddings for Finnish and Japanese now also available!
Note 1: All words are lowercased.
Note 2: All emoji have been unified into a single neutral encoding across languages (no skin tone modifiers). All Twitter users have been anonymized with @user.
Reference paper
If you use any of these resources, please cite the following paper:
@inproceedings{xlingtwitter2020icwsm,
author = "Camacho-Collados, Jose and Doval, Yerai and Mart\'{i}nez-C\'{a}mara, Eugenio and Espinosa-Anke, Luis and Barbieri, Francesco and Schockaert, Steven",
title = "Learning Cross-lingual Embeddings from Twitter via Distant Supervision",
booktitle = "Proceedings of ICWSM",
location = "Atlanta, United States",
year = "2020"
}
If you use Fasttext or VecMap, please also cite their corresponding papers.