Home

Awesome

SentEval: evaluation toolkit for sentence embeddings

SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. SentEval currently includes 17 downstream tasks. We also include a suite of 10 probing tasks which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.

(04/22) SentEval new tasks: Added probing tasks for evaluating what linguistic properties are encoded in sentence embeddings

(10/04) SentEval example scripts for three sentence encoders: SkipThought-LN/GenSen/Google-USE

Dependencies

This code is written in python. The dependencies are:

Transfer tasks

Downstream tasks

SentEval allows you to evaluate your sentence embeddings as features for the following downstream tasks:

TaskType#train#testneeds_trainset_classifier
MRmovie review11k11k11
CRproduct review4k4k11
SUBJsubjectivity status10k10k11
MPQAopinion-polarity11k11k11
SSTbinary sentiment analysis67k1.8k11
SSTfine-grained sentiment analysis8.5k2.2k11
TRECquestion-type classification6k0.5k11
SICK-Enatural language inference4.5k4.9k11
SNLInatural language inference550k9.8k11
MRPCparaphrase detection4.1k1.7k11
STS 2012semantic textual similarityN/A3.1k00
STS 2013semantic textual similarityN/A1.5k00
STS 2014semantic textual similarityN/A3.7k00
STS 2015semantic textual similarityN/A8.5k00
STS 2016semantic textual similarityN/A9.2k00
STS Bsemantic textual similarity5.7k1.4k10
SICK-Rsemantic textual similarity4.5k4.9k10
COCOimage-caption retrieval567k5*1k10

where needs_train means a model with parameters is learned on top of the sentence embeddings, and set_classifier means you can define the parameters of the classifier in the case of a classification task (see below).

Note: COCO comes with ResNet-101 2048d image embeddings. More details on the tasks.

Probing tasks

SentEval also includes a series of probing tasks to evaluate what linguistic properties are encoded in your sentence embeddings:

TaskType#train#testneeds_trainset_classifier
SentLenLength prediction100k10k11
WCWord Content analysis100k10k11
TreeDepthTree depth prediction100k10k11
TopConstTop Constituents prediction100k10k11
BShiftWord order analysis100k10k11
TenseVerb tense prediction100k10k11
SubjNumSubject number prediction100k10k11
ObjNumObject number prediction100k10k11
SOMOSemantic odd man out100k10k11
CoordInvCoordination Inversion100k10k11

Download datasets

To get all the transfer tasks datasets, run (in data/downstream/):

./get_transfer_data.bash

This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default.

How to use SentEval: examples

examples/bow.py

In examples/bow.py, we evaluate the quality of the average of word embeddings.

To download state-of-the-art fastText embeddings:

curl -Lo glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
curl -Lo crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip

To reproduce the results for bag-of-vectors, run (in examples/):

python bow.py

As required by SentEval, this script implements two functions: prepare (optional) and batcher (required) that turn text sentences into sentence embeddings. Then SentEval takes care of the evaluation on the transfer tasks using the embeddings as features.

examples/infersent.py

To get the InferSent model and reproduce our results, download our best models and run infersent.py (in examples/):

curl -Lo examples/infersent1.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent1.pkl
curl -Lo examples/infersent2.pkl https://dl.fbaipublicfiles.com/senteval/infersent/infersent2.pkl

examples/skipthought.py - examples/gensen.py - examples/googleuse.py

We also provide example scripts for three other encoders:

Note that for SkipThought and GenSen, following the steps of the associated githubs is necessary. The Google encoder script should work as-is.

How to use SentEval

To evaluate your sentence embeddings, SentEval requires that you implement two functions:

  1. prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
  2. batcher (transforms a batch of text sentences into sentence embeddings)

1.) prepare(params, samples) (optional)

batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.

prepare(params, samples)

Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.

2.) batcher(params, batch)

batcher(params, batch)

Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.

3.) evaluation on transfer tasks

After having implemented the batch and prepare function for your own sentence encoder,

  1. to perform the actual evaluation, first import senteval and set its parameters:
import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
  1. (optional) set the parameters of the classifier (when applicable):
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

You can choose nhid=0 (Logistic Regression) or nhid>0 (MLP) and define the parameters for training.

  1. Create an instance of the class SE:
se = senteval.engine.SE(params, batcher, prepare)
  1. define the set of transfer tasks and run the evaluation:
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)

The current list of available tasks is:

['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense',
'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']

SentEval parameters

Global parameters of SentEval:

# senteval parameters
task_path                   # path to SentEval datasets (required)
seed                        # seed
usepytorch                  # use cuda-pytorch (else scikit-learn) where possible
kfold                       # k-fold validation for MR/CR/SUB/MPQA.

Parameters of the classifier:

nhid:                       # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim:                      # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity:                   # how many times dev acc does not increase before training stops
epoch_size:                 # each epoch corresponds to epoch_size pass on the train set
max_epoch:                  # max number of epoches
dropout:                    # dropout for MLP

Note that to get a proxy of the results while dramatically reducing computation time, we suggest the prototyping config:

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 5}
params['classifier'] = {'nhid': 0, 'optim': 'rmsprop', 'batch_size': 128,
                                 'tenacity': 3, 'epoch_size': 2}

which will results in a 5 times speedup for classification tasks.

To produce results that are comparable to the literature, use the default config:

params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

which takes longer but will produce better and comparable results.

For probing tasks, we used an MLP with a Sigmoid nonlinearity and and tuned the nhid (in [50, 100, 200]) and dropout (in [0.0, 0.1, 0.2]) on the dev set.

References

Please considering citing [1] if using this code for evaluating sentence embedding methods.

SentEval: An Evaluation Toolkit for Universal Sentence Representations

[1] A. Conneau, D. Kiela, SentEval: An Evaluation Toolkit for Universal Sentence Representations

@article{conneau2018senteval,
  title={SentEval: An Evaluation Toolkit for Universal Sentence Representations},
  author={Conneau, Alexis and Kiela, Douwe},
  journal={arXiv preprint arXiv:1803.05449},
  year={2018}
}

Contact: aconneau@fb.com, dkiela@fb.com

Related work