Awesome
word2vec-api
Simple web service providing a word embedding API. The methods are based on Gensim Word2Vec implementation. Models are passed as parameters and must be in the Word2Vec text or binary format. Updated to run on Python 3.
- Install Dependencies
pip install -r requirements.txt
- Launching the service
python word2vec-api --model path/to/the/model [--host host --port 1234]
or
python word2vec-api.py --model /path/to/GoogleNews-vectors-negative300.bin --binary BINARY --path /word2vec --host 0.0.0.0 --port 5000
- Example calls
curl http://127.0.0.1:5000/word2vec/n_similarity?ws1=Sushi&ws1=Shop&ws2=Japanese&ws2=Restaurant
curl http://127.0.0.1:5000/word2vec/similarity?w1=Sushi&w2=Japanese
curl http://127.0.0.1:5000/word2vec/most_similar?positive=indian&positive=food[&negative=][&topn=]
curl http://127.0.0.1:5000/word2vec/model?word=restaurant
curl http://127.0.0.1:5000/word2vec/model_word_set
Note: The "model" method returns a base64 encoding of the vector. "model_word_set" returns a base64 encoded pickle of the model's vocabulary.
Where to get a pretrained model
In case you do not have domain specific data to train, it can be convenient to use a pretrained model. Please feel free to submit additions to this list through a pull request.
Model file | Number of dimensions | Corpus (size) | Vocabulary size | Author | Architecture | Training Algorithm | Context window - size | Web page |
---|---|---|---|---|---|---|---|---|
Google News | 300 | Google News (100B) | 3M | word2vec | negative sampling | BoW - ~5 | link | |
Freebase IDs | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
Freebase names | 1000 | Gooogle News (100B) | 1.4M | word2vec, skip-gram | ? | BoW - ~10 | link | |
Wikipedia+Gigaword 5 | 50 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 100 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 200 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Wikipedia+Gigaword 5 | 300 | Wikipedia+Gigaword 5 (6B) | 400,000 | GloVe | GloVe | AdaGrad | 10+10 | link |
Common Crawl 42B | 300 | Common Crawl (42B) | 1.9M | GloVe | GloVe | GloVe | AdaGrad | link |
Common Crawl 840B | 300 | Common Crawl (840B) | 2.2M | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 25 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 50 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 100 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Twitter (2B Tweets) | 200 | Twitter (27B) | ? | GloVe | GloVe | GloVe | AdaGrad | link |
Wikipedia dependency | 300 | Wikipedia (?) | 174,015 | Levy & Goldberg | word2vec modified | word2vec | syntactic dependencies | link |
DBPedia vectors (wiki2vec) | 1000 | Wikipedia (?) | ? | Idio | word2vec | word2vec, skip-gram | BoW, 10 | link |
60 Wikipedia embeddings with 4 kinds of context | 25,50,100,250,500 | Wikipedia | varies | Li, Liu et al. | Skip-Gram, CBOW, GloVe | original and modified | 2 | link |
German Wikipedia+News | 300 | Wikipedia + Statmt News 2013 (1.1B) | 608.130 | Andreas Müller | word2vec | Skip-Gram | 5 | link |