Home

Awesome

SemanticVector

This is the code for the paper "A Latent Variable Model Approach to PMI-based Word Embeddings" and the paper "Linear Algebraic Structure of Word Senses, with Applications to Polysemy".

Get Started

After cloning the code, run

./setup.sh
cd examples
./demo_vector.sh
./demo_dictionary.sh
./demo_window.sh

The script setup.sh will download data and external tools for the code. The three demo scripts are for three parts of the code. See the following for details.

Directories:

  1. data: contains some example test set
  2. examples: contains demo.sh that shows how to use the code
  3. src: contains the code. It has sub-directories: vector, dictionary, topic. These are explained below.

src/vector:

Usage

mkdir external_tools
cd external_tools
git clone https://github.com/stanfordnlp/GloVe 
make
perl wikifil.pl enwiki_raw_corpus > enwiki

An example preprocessed small corpus text8 is downloaded for the demo in setup.sh.

./demo_vector.sh

The script will make the programs, construct the vocabulary, compute and shuffle the co-occurrence, and finally construct the word vectors using the algorithm in our paper. The codes for computing the vocabulary and the co-occurrence are borrowed from GloVe. The constructed vocabulary is saved in vector_result/text8_vocab.txt, and the constructed vectors are saved in vector_result/text8_rw_vectors.bin.

[vocab, vectors] = read_vocab_vectors(vocab_file, vector_file, vector_size);

More info

Run ./randwalk in the directory vector to get help information about its options. Similarly, for the GloVe package, run ./vocab_count (or ./cooccur ./shuffle) to get help about the options. Frequently used options:

src/dictionary

Usage

./demo_dictionary.sh

The script runs learn_rw_dictionary.m in Matlab.

Note: first need to construct the word vectors; see above

Read and change the parameters in line 11 to 19 in the script to satisfy you needs. (the current parameters are for the Wikipedia corpus consisting of about 3G tokens)

The constructed dictionary will be saved in mat format in dictionary_result. The mat file contains the following variables:

src/topic

Usage

./demo_window.sh

This script downloads the needed data (about 500MB) and runs learn_window_dictionary.m. It computes window vectors (each window vector is the weighted average of the word vectors in a paragraph), and computes a dictionary on these window vectors. The atoms in this dictionary can be viewed as topic vectors.

See the README file in src/topic/ for more information.