Awesome

word2mat

Word2Mat is a framework that learns sentence embeddings in a CBOW-word2vec style, but where the words and sentences are represented as matrices. Details of this method and results can be found in our ICLR paper.

Dependencies

Python3
PyTorch >= 0.4 with CUDA support
NLTK >= 3

Setup python3 environment

Please install the python3 dependencies in your environment:

virtualenv -p python3 venv && source venv/bin/activate
pip install -r requirements.txt
python3 -c "import nltk; nltk.download('punkt')"

Download training data

In order to reproduce the results from our paper, which were trained on the UMBC corpus, download the UMBC corpus, extract the tar.gz file, and run the extract_umbc.py script in the following way:

python extract_umbc.py umbc_corpus/webbase_all <path_to_store_sentences>

This stores the sentences from the UMBC corpus in a format that is usable by our code: Each line in the resulting file contains a single sentence, whose (already pre-processed) tokens are separated by a whitespace character.

Running the experiments

Note: After further experiments, we observed that terminating training based on the validation loss produces unreliable results because of relatively high variance in the validation loss. Hence, we recommend using training loss as stopping criterion, which is more stable.

The results below are trained with this stopping criterion, and therefore slightly differ from the results reported in the ICLR paper. However, the conclusions remain the same: CMOW is much better than CBOW at capturing linguistic properties except WordContent. Therefore, CBOW is superior in almost all downstream tasks except TREC. The Hybrid model retains the capabilities of both models and therefore is extremely close to the better model among CBOW and CMOW, or better on all tasks.

Probing tasks: All scores denote accuracy.

Model	Depth	BigramShift	SubjNumber	Tense	CoordinationInversion	Length	ObjNumber	TopConstituents	OddManOut	WordContent
CBOW	32.73	49.65	79.65	79.46	53.78	75.69	79.00	72.26	49.64	89.11
CMOW	34.40	72.44	82.08	80.32	62.05	82.93	79.70	74.25	51.33	65.15
Hybrid	35.38	71.22	81.45	80.83	59.17	87.00	79.37	72.88	50.53	86.97

Supervised downstream tasks: For STS-Benchmark and Sick-Relatedness, the results denote Spearman correlation coefficient. For all others the score denotes accuracy.

Model	SNLI	SUBJ	CR	MR	MPQA	TREC	SICKEntailment	SST2	SST5	MRPC	STSBenchmark	SICKRelatedness
CBOW	67.76	90.45	79.76	74.32	87.23	84.4	79.58	78.14	41.72	72.17	0.619	0.721
CMOW	64.77	87.11	74.60	71.42	87.55	88.0	76.90	76.77	40.18	70.61	0.576	0.705
Hybrid	67.59	90.26	79.60	74.10	87.38	89.2	78.69	77.87	41.58	71.94	0.613	0.718

Unsupervised downstream tasks: The score denotes Spearman correlation coefficient.

Model	STS12	STS13	STS14	STS15	STS16
CBOW	0.458	0.497	0.556	0.637	0.630
CMOW	0.432	0.334	0.403	0.471	0.529
Hybrid	0.472	0.476	0.530	0.621	0.613

Train CBOW, CMOW, and CBOW-CMOW hybrid model

To train a 784-dimensional CBOW model, run the following:

python train_cbow.py --w2m_type cbow --batch_size=1024 --outputdir=<path_to_save_model> --optimizer adam,lr=0.0003 --max_words=30000 --n_epochs=1000 --n_negs=20 --validation_frequency=1000 --mode=random --num_samples_per_item=30 --patience 10 --downstream_eval full --outputmodelname mode w2m_type word_emb_dim --validation_fraction=0.0001 --context_size=5 --word_emb_dim 784 --temp_path <some_directory_for_temp_files> --dataset_path=<path_to_parsed_UMBC_dataset> --num_workers 2 --output_file <path_to_output.csv> --num_docs 134442680 --stop_criterion train_loss

For CMOW:

python train_cbow.py --w2m_type cmow --batch_size=1024 --outputdir=<path_to_save_model> --optimizer adam,lr=0.0003 --max_words=30000 --n_epochs=1000 --n_negs=20 --validation_frequency=1000 --mode=random --num_samples_per_item=30 --patience 10 --downstream_eval full --outputmodelname mode w2m_type word_emb_dim --validation_fraction=0.0001 --context_size=5 --word_emb_dim 784 --temp_path <some_directory_for_temp_files> --dataset_path=<path_to_parsed_UMBC_dataset> --num_workers 2 --output_file <path_to_output.csv> --num_docs 134442680 --stop_criterion train_loss --initialization identity

And the CBOW-CMOW Hybrid:

python train_cbow.py --w2m_type hybrid --batch_size=1024 --outputdir=<path_to_save_model> --optimizer adam,lr=0.0003 --max_words=30000 --n_epochs=1000 --n_negs=20 --validation_frequency=1000 --mode=random --num_samples_per_item=30 --patience 10 --downstream_eval full --outputmodelname mode w2m_type word_emb_dim --validation_fraction=0.0001 --context_size=5 --word_emb_dim 400 --temp_path <some_directory_for_temp_files> --dataset_path=<path_to_parsed_UMBC_dataset> --num_workers 2 --output_file <path_to_output.csv> --num_docs 134442680 --stop_criterion train_loss --initialization identity

Evaluate components of hybrid model

In the paper, we have shown that the jointly training of the individual CBOW/CMOW components emphasizes their individual strengths. To assess the performance of the CBOW component, restrict the final embedding representation to include only the first half of the representations from the HybridEncoder (--included_features 0 400 in a 800-dimensional Hybrid encoder), or restrict it to the second half (--included features 400 800) to evaluate the CMOW component. E.g, for evaluating the CMOW component, run:

python evaluate_word2mat.py --encoders <path_to_hybrid.encoder_file> --word_vocab <path_to_.vocab_file> --included_features 400 800 --outputdir <temp_path_to_save_encoder> --outputmodelname hybrid_constituent --downstream_eval full

Here, 'encoder' and 'word_vocab' is saved in 'outputdir' after training the models. By

Files

train_cbow.py Main training executable. Type python train_cbow.py --help to get overview of training parameters.
cbow.py Contains the data preparation code as well as the neural architecture for CBOW except the encoder.
word2mat.py The code for word2mat encoder.
wrap_evaluation.py Wrapper script for SentEval to automatically evaluate encoder after training.
evaluate_word2mat.py Script for evaluating sub-components of hybrid encoder with SentEval.
mutils.py Helpers for saving the results, hyperparameter optimization and stuff.

Reference

Please cite our ICLR paper [1] to reference our work or code.

CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model (ICLR 2019)

[1] Mai, F., Galke, L & Scherp, A., CBOW Is Not All You Need: Combining CBOW with the Compositional Matrix Space Model

@inproceedings{mai2018cbow,
title={{CBOW} Is Not All You Need: Combining {CBOW} with the Compositional Matrix Space Model},
author={Florian Mai and Lukas Galke and Ansgar Scherp},
booktitle={International Conference on Learning Representations},
year={2019},
url={https://openreview.net/forum?id=H1MgjoR9tQ},
}