Awesome
ALaCarte
This repository contains code and transforms to induce your own rare-word/n-gram vectors as well as evaluation code for the A La Carte Embedding paper. An overview is provided in this blog post at OffConvex.
If you find any of this code useful please cite the following:
@inproceedings{khodak2018alacarte,
title={A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors},
author={Khodak, Mikhail and Saunshi, Nikunj and Liang, Yingyu and Ma, Tengyu and Stewart, Brandon and Arora, Sanjeev},
booktitle={Proceedings of the ACL},
year={2018}
}
Inducing your own à la carte vectors
The following are steps to induce your own vectors for rare words or n-grams in the same semantic space as existing GloVe embeddings. For rare words from the IMDB, PTB-WSJ, SST, and STS tasks you can find vectors induced using Common Crawl / Gigaword+Wikipedia at http://nlp.cs.princeton.edu/ALaCarte/vectors/induced/.
- Make a text file containing one word or space-delimited n-gram per line. These are the targets for which vectors are to be induced.
- Download source embedding files, which should have the format "word float ... float" on each line. Can find GloVe embeddings here. Choose the appropriate transform in the <tt>transform</tt> directory.
- If using Common Crawl, download a file of WET paths (e.g. here for the 2014 crawl). Run <tt>alacarte.py</tt> with this passed to the <tt>--paths</tt> argument. Otherwise pass (one or more) text files to the <tt>--corpus</tt> argument.
Dependencies:
Required: numpy
Optional: h5py (check-pointing), nltk (n-grams), cld2-cffi (checking English), mpi4py (parallelizing using MPI), boto (Common Crawl)
For inducing vectors from Common Crawl on an AWS EC2 instance:
- Start an instance. Best to use a memory-optimized (<tt>r4.*</tt>) Linux instance.
- Download and execute <tt>install.sh</tt>.
- Upload your list of target words to the instance and run <tt>alacarte.py</tt>.
Evaluation code
- http://nlp.cs.princeton.edu/ALaCarte (GloVe Vectors)
- http://nlp.cs.princeton.edu/CRW (CRW Dataset)
Note that the code in this directory treats adding up all embeddings of context words in a corpus as a matrix operation. This is memory-intensive and more practical implementations should use simple vector addition to compute context vectors.
Dependencies: nltk, numpy, scipy, text_embedding
Optional: mpi4py (to parallelize coocurrence matrix construction)