Awesome
Spanish Sentence Embeddings
Spanish Sentence Embeddings trained using sent2vec on the Spanish Unannotated Corpora.
Pre-Processing
The data used was already preprocessed in Spanish Unannotated Corpora to lowercase, remove multiple spaces, remove urls and others. We also used the script to split on punctuation included in the previous repository.
According to that tokenization, the 2.6B words corpus got into 3.4B tokens.
sent2vec Parameters
We set default parameters of sent2vec to train a unigram + bigram model.
Download
Spanish sent2vec (700 dim sentence embeddings, unigram+bigram model, 14.4 GB)
References
Matteo Pagliardini, Prakhar Gupta, Martin Jaggi, Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features NAACL 2018