Home

Awesome

LexVec

This is an implementation of the LexVec word embedding model (similar to word2vec and GloVe) that achieves state of the art results in multiple NLP tasks, as described in these papers.

Pre-trained Vectors

Subword LexVec (paper)

LexVec (paper 1, paper 2)

Evaluation: Subword LexVec

External memory, huge corpus

ModelGSemGSynMSRRWSimLexSCWSWS-SimWS-RelMENMTurk
LexVec72.6%73.8%73.2%.539.477.687.809.696.814.717
fastText75.0%72.1%71.8%.522.424.673.810.724.805.717

Evaluation: LexVec

In-memory, large corpus

ModelGSemGSynMSRRWSimLexSCWSWS-SimWS-RelMENMTurk
LexVec, Word81.1%68.7%63.7%.489.384.652.727.619.759.655
LexVec, Word + Context79.3%62.6%56.4%.476.362.629.734.663.772.649
word2vec Skip-gram78.5%66.1%56.0%.471.347.649.774.647.759.687

External memory, huge corpus

ModelGSemGSynMSRRWSimLexSCWSWS-SimWS-RelMENMTurk
LexVec, Word76.4%71.3%70.6%.508.444.667.762.668.802.716
LexVec, Word + Context80.4%66.6%65.1%.496.419.644.775.702.813.712
word2vec73.3%75.1%75.1%.515.436.655.741.610.699.680
GloVe81.8%72.4%74.3%.384.374.540.698.571.743.645

Installation

  1. Install the Go compiler and clang.

  2. Make sure your $GOPATH is set

  3. Execute the following commands in your terminal:

    $ go get github.com/alexandres/lexvec
    $ cd $GOPATH/src/github.com/alexandres/lexvec
    $ make
    

Usage

Training

In-memory

To get started, run $ scripts/demo.sh which trains a model using the small text8 corpus (100MB from Wikipedia).

Basic usage of LexVec is:

$ OUTPUT=dirwheretostorevectors scripts/im_lexvec.sh -corpus somecorpus

Run $ ./lexvec -h for a full list of options.

External Memory

By default, LexVec stores the sparse matrix being factorized in-memory. This can be a problem if your training corpus is large and your system memory limited. We suggest you first try using the in-memory implementation. If you run into Out-Of-Memory issues, use the External Memory variant with the -memory option specifying how many GBs of memory to use for the sort buffer.

$ OUTPUT=dirwheretostorevectors scripts/em_lexvec.sh -corpus somecorpus -memory 4. ...exactsameoptionsasinmemory

Subword LexVec

Training

Subword information is controlled by the options -minn, -maxn, and -subword.

By default, the binary model used for computing OOV word vectors is saved to $OUTPUT/model.bin. Set -outputsub "" to disable saving this model.

<a name="oov"></a> Computing vectors for OOV words

Use the binary model to compute vector for OOV words:

Note: You can also use these commands to get vectors for in-vocabulary words as the binary model stores the vocabulary used for training.

<a name="refs"></a> References

Alexandre Salle, Marco Idiart, and Aline Villavicencio. "Matrix Factorization using Window Sampling and Negative Sampling for Improved Word Representations." ACL (2016). (pdf)

Alexandre Salle, Marco Idiart, and Aline Villavicencio. "Enhancing the LexVec Distributed Word Representation Model Using Positional Contexts and External Memory." arXiv preprint arXiv:1606.01283 (2016). (pdf)

Alexandre Salle and Aline Villavicencio. "Incorporating Subword Information into Matrix Factorization Word Embeddings." Second Workshop on Subword and Character LEvel Models in NLP (2018). (pdf)

Alexandre Salle and Aline Villavicencio. "Why So Down? The Role of Negative (and Positive) Pointwise Mutual Information in Distributional Semantics." arXiv preprint arXiv:1908.06941 (2019). (pdf)

License

Copyright (c) 2016-2018 Salle, Alexandre alex@alexsalle.com. All work in this package is distributed under the MIT License.