Home

Awesome

Pre-trained word vectors of 30+ languages

This project has two purposes. First of all, I'd like to share some of my experience in nlp tasks such as segmentation or word vectors. The other, which is more important, is that probably some people are searching for pre-trained word vector models for non-English languages. Alas! English has gained much more attention than any other languages has done. Check this to see how easily you can get a variety of pre-trained English word vectors without efforts. I think it's time to turn our eyes to a multi language version of this.

<b>Nearing the end of the work, I happened to know that there is already a similar job named polyglot. I strongly encourage you to check this great project. How embarrassing! Nevertheless, I decided to open this project. You will know that my job has its own flavor, after all.</b>

Requirements

Background / References

Work Flow

Pre-trained models

Two types of pre-trained models are provided. w and f represent word2vec and fastText respectively.

LanguageISO 639-1Vector SizeCorpus SizeVocabulary Size
Bengali (w) | Bengali (f)bn300147M10059
Catalan (w) | Catalan (f)ca300967M50013
Chinese (w) | Chinese (f)zh3001G50101
Danish (w) | Danish (f)da300295M30134
Dutch (w) | Dutch (f)nl3001G50160
Esperanto (w) | Esperanto (f)eo3001G50597
Finnish (w) | Finnish (f)fi300467M30029
French (w) | French (f)fr3001G50130
German (w) | German (f)de3001G50006
Hindi (w) | Hindi (f)hi300323M30393
Hungarian (w) | Hungarian (f)hu300692M40122
Indonesian (w) | Indonesian (f)id300402M30048
Italian (w) | Italian (f)it3001G50031
Japanese (w) | Japanese (f)ja3001G50108
Javanese (w) | Javanese (f)jv10031M10019
Korean (w) | Korean (f)ko200339M30185
Malay (w) | Malay (f)ms100173M10010
Norwegian (w) | Norwegian (f)no3001G50209
Norwegian Nynorsk (w) | Norwegian Nynorsk (f)nn100114M10036
Polish (w) | Polish (f)pl3001G50035
Portuguese (w) | Portuguese (f)pt3001G50246
Russian (w) | Russian (f)ru3001G50102
Spanish (w) | Spanish (f)es3001G50003
Swahili (w) | Swahili (f)sw10024M10222
Swedish (w) | Swedish (f)sv3001G50052
Tagalog (w) | Tagalog (f)tl10038M10068
Thai (w) | Thai (f)th300696M30225
Turkish (w) | Turkish (f)tr200370M30036
Vietnamese (w) | Vietnamese (f)vi10074M10087