Home

Awesome

IBM Code Model Asset Exchange: Word Embedding Generator

This repository contains code to generate word embeddings using the Swivel algorithm on IBM Watson Machine Learning. This model is part of the IBM Code Model Asset Exchange.

Machine learning algorithms usually expect numeric inputs. When a data scientist wants to use text to create a machine learning model, they must first find a way to represent their text as a vector of numbers. These vectors are called word embeddings. The Swivel algorithm is a frequency-based word embedding that uses a co-occurence matrix. The idea here is that words that have similar meanings tend to occur together in a text corpus. As a result, words that have similar meanings will have vector representations that are closer than those of unrelated words.

This demo contains scripts to run the Swivel algorithm on a preprocessed Wikipedia text corpus. For instructions on generating word embeddings on your own text corpus see the instructions in the original repository here.

Model Metadata

DomainApplicationIndustryFrameworkTraining DataInput Data Format
Text/NLPNatural LanguageGeneralTensorFlowAny Text Corpus (e.g. Wiki Dump)Text

References

[1]<a name="ref1"></a> N. Shazeer, R. Doherty, C. Evans, C. Waterson., "Swivel: Improving Embeddings by Noticing What's Missing" arXiv preprint arXiv:1602.02215 (2016)

Licenses

ComponentLicenseLink
This repositoryApache 2.0LICENSE
Model Code (3rd party)Apache 2.0TensorFlow Models
DataCC BY-SA 3.0Wikipedia Text Dump

Quickstart

Prerequisites

Setup an IBM Cloud Object Storage (COS) account

Setup IBM CLI & ML CLI

Training the model

The train.sh utility script will deploy the experiment to WML and start the training as a training-run

train.sh

After the train is started, it should print the training-id that is going to be necessary for steps below

Starting to train ...
OK
Model-ID is 'training-GCtN_YRig'

Monitor the training run

Exploring the embeddings

The demo.sh utility script will download the results from the bucket, convert the embeddings into binary vector format, and run a python application to explore the embeddings:

demo.sh

When querying a single word, the results will list words that are similar in meaning.

query> dog
dog
dogs
cat

It is also possible to query to complete an analogy. (e.g. A man is to a woman as a king is to... )

query> man woman king
king
queen
princess

Resources and Contributions

If you are interested in contributing to the Model Asset Exchange project or have any queries, please follow the instructions here.