Awesome
Spanish Word Embeddings
Spanish words embeddings computed using fastText on the Spanish Unannotated Corpora.
Pre-Processing
The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.
According to that tokenization, the 2.6B words corpus got into 3.4B tokens.
For new L we used the updated version of Spanish Unannotated Corpora which has 3B words and applied same preprocessing of the other models.
fastText Parameters
We set default parameters of fastText for Skipgram task except for epochs were we set 20 instead of 5.
Evaluation
We evaluated our word embeddings in SemEval-2017 Task 2 (Subtask 1) using the script provided by MUSE library, getting these results:
XS | S | M | L | new L | |
---|---|---|---|---|---|
Score | 0.59150 | 0.67589 | 0.72345 | 0.74676 | 0.72940 |
Being L embedding model the best one in Spanish as far as we know in the date of publication.
Download
Reference
Enriching Word Vectors with Subword Information
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information