Home

Awesome

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>

sense2vec: Contextually-keyed word vectors

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. For more details, check out our blog post. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo.

🦆 Version 2.0 (for spaCy v3) out now! Read the release notes here.

tests Current Release Version pypi Version Code style: black

✨ Features

🚀 Quickstart

Standalone usage

from sense2vec import Sense2Vec

s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")
query = "natural_language_processing|NOUN"
assert query in s2v
vector = s2v[query]
freq = s2v.get_freq(query)
most_similar = s2v.most_similar(query, n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]

Usage as a spaCy pipeline component

⚠️ Note that this example describes usage with spaCy v3. For usage with spaCy v2, download sense2vec==1.0.3 and check out the v1.x branch of this repo.

import spacy

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)
# [(('machine learning', 'NOUN'), 0.8986967),
#  (('computer vision', 'NOUN'), 0.8636297),
#  (('deep learning', 'NOUN'), 0.8573361)]

Interactive demos

<img width="34%" src="https://user-images.githubusercontent.com/13643239/68093565-1bb6ea80-fe97-11e9-8192-e293acc290fe.png" align="right" />

To try out our pretrained vectors trained on Reddit comments, check out the interactive sense2vec demo.

This repo also includes a Streamlit demo script for exploring vectors and the most similar phrases. After installing streamlit, you can run the script with streamlit run and one or more paths to pretrained vectors as positional arguments on the command line. For example:

pip install streamlit
streamlit run https://raw.githubusercontent.com/explosion/sense2vec/master/scripts/streamlit_sense2vec.py /path/to/vectors

Pretrained vectors

To use the vectors, download the archive(s) and pass the extracted directory to Sense2Vec.from_disk or Sense2VecComponent.from_disk. The vector files are attached to the GitHub release. Large files have been split into multi-part downloads.

VectorsSizeDescription📥 Download (zipped)
s2v_reddit_2019_lg4 GBReddit comments 2019 (01-07)part 1, part 2, part 3
s2v_reddit_2015_md573 MBReddit comments 2015part 1

To merge the multi-part archives, you can run the following:

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

⏳ Installation & Setup

sense2vec releases are available on pip:

pip install sense2vec

To use pretrained vectors, download one of the vector packages, unpack the .tar.gz archive and point from_disk to the extracted data directory:

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/s2v_reddit_2015_md")

👩‍💻 Usage

Usage with spaCy v3

The easiest way to use the library and vectors is to plug it into your spaCy pipeline. The sense2vec package exposes a Sense2VecComponent, which can be initialised with the shared vocab and added to your spaCy pipeline as a custom pipeline component. By default, components are added to the end of the pipeline, which is the recommended position for this component, since it needs access to the dependency parse and, if available, named entities.

import spacy
from sense2vec import Sense2VecComponent

nlp = spacy.load("en_core_web_sm")
s2v = nlp.add_pipe("sense2vec")
s2v.from_disk("/path/to/s2v_reddit_2015_md")

The component will add several extension attributes and methods to spaCy's Token and Span objects that let you retrieve vectors and frequencies, as well as most similar terms.

doc = nlp("A sentence about natural language processing.")
assert doc[3:6].text == "natural language processing"
freq = doc[3:6]._.s2v_freq
vector = doc[3:6]._.s2v_vec
most_similar = doc[3:6]._.s2v_most_similar(3)

For entities, the entity labels are used as the "sense" (instead of the token's part-of-speech tag):

doc = nlp("A sentence about Facebook and Google.")
for ent in doc.ents:
    assert ent._.in_s2v
    most_similar = ent._.s2v_most_similar(3)

Available attributes

The following extension attributes are exposed on the Doc object via the ._ property:

NameAttribute TypeTypeDescription
s2v_phrasespropertylistAll sense2vec-compatible phrases in the given Doc (noun phrases, named entities).

The following attributes are available via the ._ property of Token and Span objects – for example token._.in_s2v:

| Name | Attribute Type | Return Type | Description | | ------------------ | -------------- | ------------------ | ---------------------------------------------------------------------------------- | --------------- | ------- | | in_s2v | property | bool | Whether a key exists in the vector map. | | s2v_key | property | unicode | The sense2vec key of the given object, e.g. "duck | NOUN". | | s2v_vec | property | ndarray[float32] | The vector of the given key. | | s2v_freq | property | int | The frequency of the given key. | | s2v_other_senses | property | list | Available other senses, e.g. "duck | VERB"for"duck | NOUN". | | s2v_most_similar | method | list | Get the n most similar terms. Returns a list of ((word, sense), score) tuples. | | s2v_similarity | method | float | Get the similarity to another Token or Span. |

⚠️ A note on span attributes: Under the hood, entities in doc.ents are Span objects. This is why the pipeline component also adds attributes and methods to spans and not just tokens. However, it's not recommended to use the sense2vec attributes on arbitrary slices of the document, since the model likely won't have a key for the respective text. Span objects also don't have a part-of-speech tag, so if no entity label is present, the "sense" defaults to the root's part-of-speech tag.

Adding sense2vec to a trained pipeline

If you're training and packaging a spaCy pipeline and want to include a sense2vec component in it, you can load in the data via the [initialize] block of the training config:

[initialize.components]

[initialize.components.sense2vec]
data_path = "/path/to/s2v_reddit_2015_md"

Standalone usage

You can also use the underlying Sense2Vec class directly and load in the vectors using the from_disk method. See below for the available API methods.

from sense2vec import Sense2Vec
s2v = Sense2Vec().from_disk("/path/to/reddit_vectors-1.1.0")
most_similar = s2v.most_similar("natural_language_processing|NOUN", n=10)

⚠️ Important note: To look up entries in the vectors table, the keys need to follow the scheme of phrase_text|SENSE (note the _ instead of spaces and the | before the tag or label) – for example, machine_learning|NOUN. Also note that the underlying vector table is case-sensitive.

🎛 API

<kbd>class</kbd> Sense2Vec

The standalone Sense2Vec object that holds the vectors, strings and frequencies.

<kbd>method</kbd> Sense2Vec.__init__

Initialize the Sense2Vec object.

ArgumentTypeDescription
shapetupleThe vector shape. Defaults to (1000, 128).
stringsspacy.strings.StringStoreOptional string store. Will be created if it doesn't exist.
senseslistOptional list of all available senses. Used in methods that generate the best sense or other senses.
vectors_nameunicodeOptional name to assign to the Vectors table, to prevent clashes. Defaults to "sense2vec".
overridesdictOptional custom functions to use, mapped to names registered via the registry, e.g. {"make_key": "custom_make_key"}.
RETURNSSense2VecThe newly constructed object.
s2v = Sense2Vec(shape=(300, 128), senses=["VERB", "NOUN"])

<kbd>method</kbd> Sense2Vec.__len__

The number of rows in the vectors table.

ArgumentTypeDescription
RETURNSintThe number of rows in the vectors table.
s2v = Sense2Vec(shape=(300, 128))
assert len(s2v) == 300

<kbd>method</kbd> Sense2Vec.__contains__

Check if a key is in the vectors table.

ArgumentTypeDescription
keyunicode / intThe key to look up.
RETURNSboolWhether the key is in the table.
s2v = Sense2Vec(shape=(10, 4))
s2v.add("avocado|NOUN", numpy.asarray([4, 2, 2, 2], dtype=numpy.float32))
assert "avocado|NOUN" in s2v
assert "avocado|VERB" not in s2v

<kbd>method</kbd> Sense2Vec.__getitem__

Retrieve a vector for a given key. Returns None if the key is not in the table.

ArgumentTypeDescription
keyunicode / intThe key to look up.
RETURNSnumpy.ndarrayThe vector or None.
vec = s2v["avocado|NOUN"]

<kbd>method</kbd> Sense2Vec.__setitem__

Set a vector for a given key. Will raise an error if the key doesn't exist. To add a new entry, use Sense2Vec.add.

ArgumentTypeDescription
keyunicode / intThe key.
vectornumpy.ndarrayThe vector to set.
vec = s2v["avocado|NOUN"]
s2v["avacado|NOUN"] = vec

<kbd>method</kbd> Sense2Vec.add

Add a new vector to the table.

ArgumentTypeDescription
keyunicode / intThe key to add.
vectornumpy.ndarrayThe vector to add.
freqintOptional frequency count. Used to find best matching senses.
vec = s2v["avocado|NOUN"]
s2v.add("🥑|NOUN", vec, 1234)

<kbd>method</kbd> Sense2Vec.get_freq

Get the frequency count for a given key.

ArgumentTypeDescription
keyunicode / intThe key to look up.
default-Default value to return if no frequency is found.
RETURNSintThe frequency count.
vec = s2v["avocado|NOUN"]
s2v.add("🥑|NOUN", vec, 1234)
assert s2v.get_freq("🥑|NOUN") == 1234

<kbd>method</kbd> Sense2Vec.set_freq

Set a frequency count for a given key.

ArgumentTypeDescription
keyunicode / intThe key to set the count for.
freqintThe frequency count.
s2v.set_freq("avocado|NOUN", 104294)

<kbd>method</kbd> Sense2Vec.__iter__, Sense2Vec.items

Iterate over the entries in the vectors table.

ArgumentTypeDescription
YIELDStupleString key and vector pairs in the table.
for key, vec in s2v:
    print(key, vec)

for key, vec in s2v.items():
    print(key, vec)

<kbd>method</kbd> Sense2Vec.keys

Iterate over the keys in the table.

ArgumentTypeDescription
YIELDSunicodeThe string keys in the table.
all_keys = list(s2v.keys())

<kbd>method</kbd> Sense2Vec.values

Iterate over the vectors in the table.

ArgumentTypeDescription
YIELDSnumpy.ndarrayThe vectors in the table.
all_vecs = list(s2v.values())

<kbd>property</kbd> Sense2Vec.senses

The available senses in the table, e.g. "NOUN" or "VERB" (added at initialization).

ArgumentTypeDescription
RETURNSlistThe available senses.
s2v = Sense2Vec(senses=["VERB", "NOUN"])
assert "VERB" in s2v.senses

<kbd>property</kbd> Sense2vec.frequencies

The frequencies of the keys in the table, in descending order.

ArgumentTypeDescription
RETURNSlistThe (key, freq) tuples by frequency, descending.
most_frequent = s2v.frequencies[:10]
key, score = s2v.frequencies[0]

<kbd>method</kbd> Sense2vec.similarity

Make a semantic similarity estimate of two keys or two sets of keys. The default estimate is cosine similarity using an average of vectors.

ArgumentTypeDescription
keys_aunicode / int / iterableThe string or integer key(s).
keys_bunicode / int / iterableThe other string or integer key(s).
RETURNSfloatThe similarity score.
keys_a = ["machine_learning|NOUN", "natural_language_processing|NOUN"]
keys_b = ["computer_vision|NOUN", "object_detection|NOUN"]
print(s2v.similarity(keys_a, keys_b))
assert s2v.similarity("machine_learning|NOUN", "machine_learning|NOUN") == 1.0

<kbd>method</kbd> Sense2Vec.most_similar

Get the most similar entries in the table. If more than one key is provided, the average of the vectors is used. To make this method faster, see the script for precomputing a cache of the nearest neighbors.

ArgumentTypeDescription
keysunicode / int / iterable The string or integer key(s) to compare to.
nintThe number of similar keys to return. Defaults to 10.
batch_sizeintThe batch size to use. Defaults to 16.
RETURNSlistThe (key, score) tuples of the most similar vectors.
most_similar = s2v.most_similar("natural_language_processing|NOUN", n=3)
# [('machine_learning|NOUN', 0.8986967),
#  ('computer_vision|NOUN', 0.8636297),
#  ('deep_learning|NOUN', 0.8573361)]

<kbd>method</kbd> Sense2Vec.get_other_senses

Find other entries for the same word with a different sense, e.g. "duck|VERB" for "duck|NOUN".

ArgumentTypeDescription
keyunicode / intThe key to check.
ignore_caseboolCheck for uppercase, lowercase and titlecase. Defaults to True.
RETURNSlistThe string keys of other entries with different senses.
other_senses = s2v.get_other_senses("duck|NOUN")
# ['duck|VERB', 'Duck|ORG', 'Duck|VERB', 'Duck|PERSON', 'Duck|ADJ']

<kbd>method</kbd> Sense2Vec.get_best_sense

Find the best-matching sense for a given word based on the available senses and frequency counts. Returns None if no match is found.

ArgumentTypeDescription
wordunicodeThe word to check.
senseslistOptional list of senses to limit the search to. If not set / empty, all senses in the vectors are used.
ignore_caseboolCheck for uppercase, lowercase and titlecase. Defaults to True.
RETURNSunicodeThe best-matching key or None.
assert s2v.get_best_sense("duck") == "duck|NOUN"
assert s2v.get_best_sense("duck", ["VERB", "ADJ"]) == "duck|VERB"

<kbd>method</kbd> Sense2Vec.to_bytes

Serialize a Sense2Vec object to a bytestring.

ArgumentTypeDescription
excludelistNames of serialization fields to exclude.
RETURNSbytesThe serialized Sense2Vec object.
s2v_bytes = s2v.to_bytes()

<kbd>method</kbd> Sense2Vec.from_bytes

Load a Sense2Vec object from a bytestring.

ArgumentTypeDescription
bytes_databytesThe data to load.
excludelistNames of serialization fields to exclude.
RETURNSSense2VecThe loaded object.
s2v_bytes = s2v.to_bytes()
new_s2v = Sense2Vec().from_bytes(s2v_bytes)

<kbd>method</kbd> Sense2Vec.to_disk

Serialize a Sense2Vec object to a directory.

ArgumentTypeDescription
pathunicode / PathThe path.
excludelistNames of serialization fields to exclude.
s2v.to_disk("/path/to/sense2vec")

<kbd>method</kbd> Sense2Vec.from_disk

Load a Sense2Vec object from a directory.

ArgumentTypeDescription
pathunicode / PathThe path to load from
excludelistNames of serialization fields to exclude.
RETURNSSense2VecThe loaded object.
s2v.to_disk("/path/to/sense2vec")
new_s2v = Sense2Vec().from_disk("/path/to/sense2vec")

<kbd>class</kbd> Sense2VecComponent

The pipeline component to add sense2vec to spaCy pipelines.

<kbd>method</kbd> Sense2VecComponent.__init__

Initialize the pipeline component.

ArgumentTypeDescription
vocabVocabThe shared Vocab. Mostly used for the shared StringStore.
shapetupleThe vector shape.
merge_phrasesboolWhether to merge sense2vec phrases into one token. Defaults to False.
lemmatizeboolAlways look up lemmas if available in the vectors, otherwise default to original word. Defaults to False.
overridesOptional custom functions to use, mapped to names registred via the registry, e.g. {"make_key": "custom_make_key"}.
RETURNSSense2VecComponentThe newly constructed object.
s2v = Sense2VecComponent(nlp.vocab)

<kbd>classmethod</kbd> Sense2VecComponent.from_nlp

Initialize the component from an nlp object. Mostly used as the component factory for the entry point (see setup.cfg) and to auto-register via the @spacy.component decorator.

ArgumentTypeDescription
nlpLanguageThe nlp object.
**cfg-Optional config parameters.
RETURNSSense2VecComponentThe newly constructed object.
s2v = Sense2VecComponent.from_nlp(nlp)

<kbd>method</kbd> Sense2VecComponent.__call__

Process a Doc object with the component. Typically only called as part of the spaCy pipeline and not directly.

ArgumentTypeDescription
docDocThe document to process.
RETURNSDocthe processed document.

<kbd>method</kbd> Sense2Vec.init_component

Register the component-specific extension attributes here and only if the component is added to the pipeline and used – otherwise, tokens will still get the attributes even if the component is only created and not added.

<kbd>method</kbd> Sense2VecComponent.to_bytes

Serialize the component to a bytestring. Also called when the component is added to the pipeline and you run nlp.to_bytes.

ArgumentTypeDescription
RETURNSbytesThe serialized component.

<kbd>method</kbd> Sense2VecComponent.from_bytes

Load a component from a bytestring. Also called when you run nlp.from_bytes.

ArgumentTypeDescription
bytes_databytesThe data to load.
RETURNSSense2VecComponentThe loaded object.

<kbd>method</kbd> Sense2VecComponent.to_disk

Serialize the component to a directory. Also called when the component is added to the pipeline and you run nlp.to_disk.

ArgumentTypeDescription
pathunicode / PathThe path.

<kbd>method</kbd> Sense2VecComponent.from_disk

Load a Sense2Vec object from a directory. Also called when you run nlp.from_disk.

ArgumentTypeDescription
pathunicode / PathThe path to load from
RETURNSSense2VecComponentThe loaded object.

<kbd>class</kbd> registry

Function registry (powered by catalogue) to easily customize the functions used to generate keys and phrases. Allows you to decorate and name custom functions, swap them out and serialize the custom names when you save out the model. The following registry options are available:

| Name | Description | | ------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------- | | registry.make_key | Given a word and sense, return a string of the key, e.g. "word | sense". | | registry.split_key | Given a string key, return a (word, sense) tuple. | | registry.make_spacy_key | Given a spaCy object (Token or Span) and a boolean prefer_ents keyword argument (whether to prefer the entity label for single tokens), return a (word, sense) tuple. Used in extension attributes to generate a key for tokens and spans. | | | registry.get_phrases | Given a spaCy Doc, return a list of Span objects used for sense2vec phrases (typically noun phrases and named entities). | | registry.merge_phrases | Given a spaCy Doc, get all sense2vec phrases and merge them into single tokens.  |

Each registry has a register method that can be used as a function decorator and takes one argument, the name of the custom function.

from sense2vec import registry

@registry.make_key.register("custom")
def custom_make_key(word, sense):
    return f"{word}###{sense}"

@registry.split_key.register("custom")
def custom_split_key(key):
    word, sense = key.split("###")
    return word, sense

When initializing the Sense2Vec object, you can now pass in a dictionary of overrides with the names of your custom registered functions.

overrides = {"make_key": "custom", "split_key": "custom"}
s2v = Sense2Vec(overrides=overrides)

This makes it easy to experiment with different strategies and serializing the strategies as plain strings (instead of having to pass around and/or pickle the functions themselves).

🚂 Training your own sense2vec vectors

The /scripts directory contains command line utilities for preprocessing text and training your own vectors.

Requirements

To train your own sense2vec vectors, you'll need the following:

Step-by-step process

The training process is split up into several steps to allow you to resume at any given point. Processing scripts are designed to operate on single files, making it easy to parallellize the work. The scripts in this repo require either Glove or fastText which you need to clone and make.

For Fasttext, the scripts will require the path to the created binary file. If you're working on Windows, you can build with cmake, or alternatively use the .exe file from this unofficial repo with FastText binary builds for Windows: https://github.com/xiamx/fastText/releases.

ScriptDescription
1.01_parse.pyUse spaCy to parse the raw text and output binary collections of Doc objects (see DocBin).
2.02_preprocess.pyLoad a collection of parsed Doc objects produced in the previous step and output text files in the sense2vec format (one sentence per line and merged phrases with senses).
3.03_glove_build_counts.pyUse GloVe to build the vocabulary and counts. Skip this step if you're using Word2Vec via FastText.
4.04_glove_train_vectors.py<br />04_fasttext_train_vectors.pyUse GloVe or FastText to train vectors.
5.05_export.pyLoad the vectors and frequencies and output a sense2vec component that can be loaded via Sense2Vec.from_disk.
6.06_precompute_cache.pyOptional: Precompute nearest-neighbor queries for every entry in the vocab to make Sense2Vec.most_similar faster.

For more detailed documentation of the scripts, check out the source or run them with --help. For example, python scripts/01_parse.py --help.

🍳 Prodigy recipes

This package also seamlessly integrates with the Prodigy annotation tool and exposes recipes for using sense2vec vectors to quickly generate lists of multi-word phrases and bootstrap NER annotations. To use a recipe, sense2vec needs to be installed in the same environment as Prodigy. For an example of a real-world use case, check out this NER project with downloadable datasets.

The following recipes are available – see below for more detailed docs.

RecipeDescription
sense2vec.teachBootstrap a terminology list using sense2vec.
sense2vec.to-patternsConvert phrases dataset to token-based match patterns.
sense2vec.evalEvaluate a sense2vec model by asking about phrase triples.
sense2vec.eval-most-similarEvaluate a sense2vec model by correcting the most similar entries.
sense2vec.eval-abPerform an A/B evaluation of two pretrained sense2vec vector models.

<kbd>recipe</kbd> sense2vec.teach

Bootstrap a terminology list using sense2vec. Prodigy will suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used.

prodigy sense2vec.teach [dataset] [vectors_path] [--seeds] [--threshold]
[--n-similar] [--batch-size] [--resume]
ArgumentTypeDescription
datasetpositionalDataset to save annotations to.
vectors_pathpositionalPath to pretrained sense2vec vectors.
--seeds, -soptionOne or more comma-separated seed phrases.
--threshold, -toptionSimilarity threshold. Defaults to 0.85.
--n-similar, -noptionNumber of similar items to get at once.
--batch-size, -boptionBatch size for submitting annotations.
--resume, -RflagResume from an existing phrases dataset.

Example

prodigy sense2vec.teach tech_phrases /path/to/s2v_reddit_2015_md
--seeds "natural language processing, machine learning, artificial intelligence"

<kbd>recipe</kbd> sense2vec.to-patterns

Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns that can be used with spaCy's EntityRuler or recipes like ner.match. If no output file is specified, the patterns are written to stdout. The examples are tokenized so that multi-token terms are represented correctly, e.g.: {"label": "SHOE_BRAND", "pattern": [{ "LOWER": "new" }, { "LOWER": "balance" }]}.

prodigy sense2vec.to-patterns [dataset] [spacy_model] [label] [--output-file]
[--case-sensitive] [--dry]
ArgumentTypeDescription
datasetpositionalPhrase dataset to convert.
spacy_modelpositionalspaCy model for tokenization.
labelpositionalLabel to apply to all patterns.
--output-file, -ooptionOptional output file. Defaults to stdout.
--case-sensitive, -CSflagMake patterns case-sensitive.
--dry, -DflagPerform a dry run and don't output anything.

Example

prodigy sense2vec.to-patterns tech_phrases en_core_web_sm TECHNOLOGY
--output-file /path/to/patterns.jsonl

<kbd>recipe</kbd> sense2vec.eval

Evaluate a sense2vec model by asking about phrase triples: is word A more similar to word B, or to word C? If the human mostly agrees with the model, the vectors model is good. The recipe will only ask about vectors with the same sense and supports different example selection strategies.

prodigy sense2vec.eval [dataset] [vectors_path] [--strategy] [--senses]
[--exclude-senses] [--n-freq] [--threshold] [--batch-size] [--eval-whole]
[--eval-only] [--show-scores]
ArgumentTypeDescription
datasetpositionalDataset to save annotations to.
vectors_pathpositionalPath to pretrained sense2vec vectors.
--strategy, -stoptionExample selection strategy. most similar (default) or random.
--senses, -soptionComma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
--exclude-senses, -esoptionComma-separated list of senses to exclude. See prodigy_recipes.EVAL_EXCLUDE_SENSES fro the defaults.
--n-freq, -foptionNumber of most frequent entries to limit to.
--threshold, -toptionMinimum similarity threshold to consider examples.
--batch-size, -boptionBatch size to use.
--eval-whole, -EflagEvaluate the whole dataset instead of the current session.
--eval-only, -OflagDon't annotate, only evaluate the current dataset.
--show-scores, -SflagShow all scores for debugging.

Strategies

NameDescription
most_similarPick a random word from a random sense and get its most similar entries of the same sense. Ask about the similarity to the last and middle entry from that selection.
most_least_similarPick a random word from a random sense and get the least similar entry from its most similar entries, and then the last most similar entry of that.
randomPick a random sample of 3 words from the same random sense.

Example

prodigy sense2vec.eval vectors_eval /path/to/s2v_reddit_2015_md
--senses NOUN,ORG,PRODUCT --threshold 0.5

UI preview of sense2vec.eval

<kbd>recipe</kbd> sense2vec.eval-most-similar

Evaluate a vectors model by looking at the most similar entries it returns for a random phrase and unselecting the mistakes.

prodigy sense2vec.eval [dataset] [vectors_path] [--senses] [--exclude-senses]
[--n-freq] [--n-similar] [--batch-size] [--eval-whole] [--eval-only]
[--show-scores]
ArgumentTypeDescription
datasetpositionalDataset to save annotations to.
vectors_pathpositionalPath to pretrained sense2vec vectors.
--senses, -soptionComma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
--exclude-senses, -esoptionComma-separated list of senses to exclude. See prodigy_recipes.EVAL_EXCLUDE_SENSES fro the defaults.
--n-freq, -foptionNumber of most frequent entries to limit to.
--n-similar, -noptionNumber of similar items to check. Defaults to 10.
--batch-size, -boptionBatch size to use.
--eval-whole, -EflagEvaluate the whole dataset instead of the current session.
--eval-only, -OflagDon't annotate, only evaluate the current dataset.
--show-scores, -SflagShow all scores for debugging.
prodigy sense2vec.eval-most-similar vectors_eval_sim /path/to/s2v_reddit_2015_md
--senses NOUN,ORG,PRODUCT

<kbd>recipe</kbd> sense2vec.eval-ab

Perform an A/B evaluation of two pretrained sense2vec vector models by comparing the most similar entries they return for a random phrase. The UI shows two randomized options with the most similar entries of each model and highlights the phrases that differ. At the end of the annotation session the overall stats and preferred model are shown.

prodigy sense2vec.eval [dataset] [vectors_path_a] [vectors_path_b] [--senses]
[--exclude-senses] [--n-freq] [--n-similar] [--batch-size] [--eval-whole]
[--eval-only] [--show-mapping]
ArgumentTypeDescription
datasetpositionalDataset to save annotations to.
vectors_path_apositionalPath to pretrained sense2vec vectors.
vectors_path_bpositionalPath to pretrained sense2vec vectors.
--senses, -soptionComma-separated list of senses to limit the selection to. If not set, all senses in the vectors will be used.
--exclude-senses, -esoptionComma-separated list of senses to exclude. See prodigy_recipes.EVAL_EXCLUDE_SENSES fro the defaults.
--n-freq, -foptionNumber of most frequent entries to limit to.
--n-similar, -noptionNumber of similar items to check. Defaults to 10.
--batch-size, -boptionBatch size to use.
--eval-whole, -EflagEvaluate the whole dataset instead of the current session.
--eval-only, -OflagDon't annotate, only evaluate the current dataset.
--show-mapping, -SflagShow which models are option 1 and option 2 in the UI (for debugging).
prodigy sense2vec.eval-ab vectors_eval_sim /path/to/s2v_reddit_2015_md /path/to/s2v_reddit_2019_md --senses NOUN,ORG,PRODUCT

UI preview of sense2vec.eval-ab

Pretrained vectors

The pretrained Reddit vectors support the following "senses", either part-of-speech tags or entity labels. For more details, see spaCy's annotation scheme overview.

TagDescriptionExamples
ADJadjectivebig, old, green
ADPadpositionin, to, during
ADVadverbvery, tomorrow, down, where
AUXauxiliary is, has (done), will (do)
CONJconjunctionand, or, but
DETdeterminera, an, the
INTJinterjectionpsst, ouch, bravo, hello
NOUNnoungirl, cat, tree, air, beauty
NUMnumeral1, 2017, one, seventy-seven, MMXIV
PARTparticle's, not
PRONpronounI, you, he, she, myself, somebody
PROPNproper nounMary, John, London, NATO, HBO
PUNCTpunctuation, ? ( )
SCONJsubordinating conjunctionif, while, that
SYMsymbol$, %, =, :), 😝
VERBverbrun, runs, running, eat, ate, eating
Entity LabelDescription
PERSONPeople, including fictional.
NORPNationalities or religious or political groups.
FACILITYBuildings, airports, highways, bridges, etc.
ORGCompanies, agencies, institutions, etc.
GPECountries, cities, states.
LOCNon-GPE locations, mountain ranges, bodies of water.
PRODUCTObjects, vehicles, foods, etc. (Not services.)
EVENTNamed hurricanes, battles, wars, sports events, etc.
WORK_OF_ARTTitles of books, songs, etc.
LANGUAGEAny named language.