Home

Awesome

💎 ruby-spacy

Overview

ruby-spacy is a wrapper module for using spaCy from the Ruby programming language via PyCall. This module aims to make it easy and natural for Ruby programmers to use spaCy. This module covers the areas of spaCy functionality for using many varieties of its language models, not for building ones.

Functionality
Tokenization, lemmatization, sentence segmentation
Part-of-speech tagging and dependency parsing
Named entity recognition
Syntactic dependency visualization
Access to pre-trained word vectors
OpenAI Chat/Completion/Embeddings API integration

Current Version: 0.2.2

Installation of Prerequisites

IMPORTANT: Make sure that the enable-shared option is enabled in your Python installation. You can use pyenv to install any version of Python you like. Install Python 3.10.6, for instance, using pyenv with enable-shared as follows:

$ env CONFIGURE_OPTS="--enable-shared" pyenv install 3.10.6

Remember to make it accessible from your working directory. It is recommended that you set global to the version of python you just installed.

$ pyenv global 3.10.6 

Then, install spaCy. If you use pip, the following command will do:

$ pip install spacy

Install trained language models. For a starter, en_core_web_sm will be the most useful to conduct basic text processing in English. However, if you want to use advanced features of spaCy, such as named entity recognition or document similarity calculation, you should also install a larger model like en_core_web_lg.

$ python -m spacy download en_core_web_sm
$ python -m spacy download en_core_web_lg

See Spacy: Models & Languages for other models in various languages. To install models for the Japanese language, for instance, you can do it as follows:

$ python -m spacy download ja_core_news_sm
$ python -m spacy download ja_core_news_lg

Installation of ruby-spacy

Add this line to your application's Gemfile:

gem 'ruby-spacy'

And then execute:

$ bundle install

Or install it yourself as:

$ gem install ruby-spacy

Usage

See Examples below.

Examples

Many of the following examples are Python-to-Ruby translations of code snippets in spaCy 101. For more examples, look inside the examples directory.

Tokenization

spaCy: Tokenization

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("en_core_web_sm")

doc = nlp.read("Apple is looking at buying U.K. startup for $1 billion")

row = []

doc.each do |token|
  row << token.text
end

headings = [1,2,3,4,5,6,7,8,9,10]
table = Terminal::Table.new rows: [row], headings: headings

puts table

Output:

1234567891011
AppleislookingatbuyingU.K.startupfor$1billion

Part-of-speech and Dependency

spaCy: Part-of-speech tags and dependencies

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("Apple is looking at buying U.K. startup for $1 billion")

headings = ["text", "lemma", "pos", "tag", "dep"]
rows = []

doc.each do |token|
  rows << [token.text, token.lemma, token.pos, token.tag, token.dep]
end

table = Terminal::Table.new rows: rows, headings: headings
puts table

Output:

textlemmapostagdep
AppleApplePROPNNNPnsubj
isbeAUXVBZaux
lookinglookVERBVBGROOT
atatADPINprep
buyingbuyVERBVBGpcomp
U.K.U.K.PROPNNNPdobj
startupstartupNOUNNNadvcl
forforADPINprep
$$SYM$quantmod
11NUMCDcompound
billionbillionNUMCDpobj

Part-of-speech and Dependency (Japanese)

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("ja_core_news_lg")
doc = nlp.read("任天堂は1983年にファミコンを14,800円で発売した。")

headings = ["text", "lemma", "pos", "tag", "dep"]
rows = []

doc.each do |token|
  rows << [token.text, token.lemma, token.pos, token.tag, token.dep]
end

table = Terminal::Table.new rows: rows, headings: headings
puts table

Output:

textlemmapostagdep
任天堂任天堂PROPN名詞-固有名詞-一般nsubj
ADP助詞-係助詞case
19831983NUM名詞-数詞nummod
NOUN名詞-普通名詞-助数詞可能obl
ADP助詞-格助詞case
ファミコンファミコンNOUN名詞-普通名詞-一般obj
ADP助詞-格助詞case
14,80014,800NUM名詞-数詞fixed
NOUN名詞-普通名詞-助数詞可能obl
ADP助詞-格助詞case
発売発売VERB名詞-普通名詞-サ変可能ROOT
するAUX動詞-非自立可能aux
AUX助動詞aux
PUNCT補助記号-句点punct

Morphology

POS and morphology tags

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("Apple is looking at buying U.K. startup for $1 billion")

headings = ["text", "shape", "is_alpha", "is_stop", "morphology"]
rows = []

doc.each do |token|
  morph = token.morphology.map do |k, v|
    "#{k} = #{v}"
  end.join("\n")
  rows << [token.text, token.shape, token.is_alpha, token.is_stop, morph]
end

table = Terminal::Table.new rows: rows, headings: headings
puts table

Output:

textshapeis_alphais_stopmorphology
AppleXxxxxtruefalseNounType = Prop<br />Number = Sing
isxxtruetrueMood = Ind<br />Number = Sing<br />Person = 3<br />Tense = Pres<br />VerbForm = Fin
lookingxxxxtruefalseAspect = Prog<br />Tense = Pres<br />VerbForm = Part
atxxtruetrue
buyingxxxxtruefalseAspect = Prog<br />Tense = Pres<br />VerbForm = Part
U.K.X.X.falsefalseNounType = Prop<br />Number = Sing
startupxxxxtruefalseNumber = Sing
forxxxtruetrue
$$falsefalse
1dfalsefalseNumType = Card
billionxxxxtruefalseNumType = Card

Visualizing Dependency

spaCy: Visualizers

Ruby code:

require "ruby-spacy"

nlp = Spacy::Language.new("en_core_web_sm")

sentence = "Autonomous cars shift insurance liability toward manufacturers"
doc = nlp.read(sentence)

dep_svg = doc.displacy(style: "dep", compact: false)

File.open(File.join("test_dep.svg"), "w") do |file|
  file.write(dep_svg)
end

Output:

Visualizing Dependency (Compact)

Ruby code:

require "ruby-spacy"

nlp = Spacy::Language.new("en_core_web_sm")

sentence = "Autonomous cars shift insurance liability toward manufacturers"
doc = nlp.read(sentence)

dep_svg = doc.displacy(style: "dep", compact: true)

File.open(File.join("test_dep_compact.svg"), "w") do |file|
  file.write(dep_svg)
end

Output:

Named Entity Recognition

spaCy: Named entities

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("en_core_web_sm")
doc =nlp.read("Apple is looking at buying U.K. startup for $1 billion")

rows = []

doc.ents.each do |ent|
  rows << [ent.text, ent.start_char, ent.end_char, ent.label]
end

headings = ["text", "start_char", "end_char", "label"]
table = Terminal::Table.new rows: rows, headings: headings
puts table

Output:

textstart_charend_charlabel
Apple05ORG
U.K.2731GPE
$1 billion4454MONEY

Named Entity Recognition (Japanese)

Ruby code:

require( "ruby-spacy")
require "terminal-table"

nlp = Spacy::Language.new("ja_core_news_lg")

sentence = "任天堂は1983年にファミコンを14,800円で発売した。"
doc = nlp.read(sentence)

rows = []

doc.ents.each do |ent|
  rows << [ent.text, ent.start_char, ent.end_char, ent.label]
end

headings = ["text", "start", "end", "label"]
table = Terminal::Table.new rows: rows, headings: headings
print table

Output:

textstartendlabel
任天堂03ORG
1983年49DATE
ファミコン1015PRODUCT
14,800円1623MONEY

Checking Availability of Word Vectors

spaCy: Word vectors and similarity

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("en_core_web_lg")
doc = nlp.read("dog cat banana afskfsd")

rows = []

doc.each do |token|
  rows << [token.text, token.has_vector, token.vector_norm, token.is_oov]
end

headings = ["text", "has_vector", "vector_norm", "is_oov"]
table = Terminal::Table.new rows: rows, headings: headings
puts table

Output:

texthas_vectorvector_normis_oov
dogtrue7.0336733false
cattrue6.6808186false
bananatrue6.700014false
afskfsdfalse0.0true

Similarity Calculation

Ruby code:

require "ruby-spacy"

nlp = Spacy::Language.new("en_core_web_lg")
doc1 = nlp.read("I like salty fries and hamburgers.")
doc2 = nlp.read("Fast food tastes very good.")

puts "Doc 1: " + doc1.text
puts "Doc 2: " + doc2.text
puts "Similarity: #{doc1.similarity(doc2)}"

Output:

Doc 1: I like salty fries and hamburgers.
Doc 2: Fast food tastes very good.
Similarity: 0.7687607012190486

Similarity Calculation (Japanese)

Ruby code:

require "ruby-spacy"

nlp = Spacy::Language.new("ja_core_news_lg")
ja_doc1 = nlp.read("今日は雨ばっかり降って、嫌な天気ですね。")
puts "doc1: #{ja_doc1.text}"
ja_doc2 = nlp.read("あいにくの悪天候で残念です。")
puts "doc2: #{ja_doc2.text}"
puts "Similarity: #{ja_doc1.similarity(ja_doc2)}"

Output:

doc1: 今日は雨ばっかり降って、嫌な天気ですね。
doc2: あいにくの悪天候で残念です。
Similarity: 0.8684192637149641

Word Vector Calculation

Tokyo - Japan + France = Paris ?

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("en_core_web_lg")

tokyo = nlp.get_lexeme("Tokyo")
japan = nlp.get_lexeme("Japan")
france = nlp.get_lexeme("France")

query = tokyo.vector - japan.vector + france.vector

headings = ["rank", "text", "score"]
rows = []

results = nlp.most_similar(query, 10)
results.each_with_index do |lexeme, i|
  index = (i + 1).to_s
  rows << [index, lexeme.text, lexeme.score]
end

table = Terminal::Table.new rows: rows, headings: headings
puts table

Output:

ranktextscore
1FRANCE0.8346999883651733
2France0.8346999883651733
3france0.8346999883651733
4PARIS0.7703999876976013
5paris0.7703999876976013
6Paris0.7703999876976013
7TOULOUSE0.6381999850273132
8Toulouse0.6381999850273132
9toulouse0.6381999850273132
10marseille0.6370999813079834

Word Vector Calculation (Japanese)

東京 - 日本 + フランス = パリ ?

Ruby code:

require "ruby-spacy"
require "terminal-table"

nlp = Spacy::Language.new("ja_core_news_lg")

tokyo = nlp.get_lexeme("東京")
japan = nlp.get_lexeme("日本")
france = nlp.get_lexeme("フランス")

query = tokyo.vector - japan.vector + france.vector

headings = ["rank", "text", "score"]
rows = []

results = nlp.most_similar(query, 10)
results.each_with_index do |lexeme, i|
  index = (i + 1).to_s
  rows << [index, lexeme.text, lexeme.score]
end

table = Terminal::Table.new rows: rows, headings: headings
puts table

Output:

ranktextscore
1パリ0.7376999855041504
2フランス0.7221999764442444
3東京0.6697999835014343
4ストラスブール0.631600022315979
5リヨン0.5939000248908997
6Paris0.574400007724762
7ベルギー0.5683000087738037
8ニース0.5679000020027161
9アルザス0.5644999742507935
10南仏0.5547999739646912

OpenAI API Integration

⚠️ This feature is currently experimental. Details are subject to change. Please refer to OpenAI's API reference and Ruby OpenAI for available parameters (max_tokens, temperature, etc).

Easily leverage GPT models within ruby-spacy by using an OpenAI API key. When constructing prompts for the Doc::openai_query method, you can incorporate the following token properties of the document. These properties are retrieved through function calls (made internally by GPT when necessary) and seamlessly integrated into your prompt. Note that function calls need gpt-3.5-turbo-0613 or higher. The available properties include:

GPT Prompting (Translation)

Ruby code:


require "ruby-spacy"

api_key = ENV["OPENAI_API_KEY"]
nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("The Beatles released 12 studio albums")

# default parameter values
# max_tokens: 1000
# temperature: 0.7
# model: "gpt-3.5-turbo-0613"
res1 = doc.openai_query(
  access_token: api_key,
  prompt: "Translate the text to Japanese."
)
puts res1

Output:

ビートルズは12枚のスタジオアルバムをリリースしました。

GPT Prompting (Elaboration)

Ruby code:

require "ruby-spacy"

api_key = ENV["OPENAI_API_KEY"]
nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("The Beatles were an English rock band formed in Liverpool in 1960.")

# default parameter values
# max_tokens: 1000
# temperature: 0.7
# model: "gpt-3.5-turbo-0613"
res = doc.openai_query(
  access_token: api_key,
  prompt: "Extract the topic of the document and list 10 entities (names, concepts, locations, etc.) that are relevant to the topic."
)

Output:

Topic: The Beatles

Entities:

  1. The Beatles (band)
  2. English (nationality)
  3. Rock band
  4. Liverpool (city)
  5. 1960 (year)
  6. John Lennon (member)
  7. Paul McCartney (member)
  8. George Harrison (member)
  9. Ringo Starr (member)
  10. Music

GPT Prompting (JSON Output Using RAG with Token Properties)

Ruby code:

require "ruby-spacy"

api_key = ENV["OPENAI_API_KEY"]
nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("The Beatles released 12 studio albums")

# default parameter values
# max_tokens: 1000
# temperature: 0.7
# model: "gpt-3.5-turbo-0613"
res = doc.openai_query(
  access_token: api_key,
  prompt: "List token data of each of the words used in the sentence. Add 'meaning' property and value (brief semantic definition) to each token data. Output as a JSON object."
)

Output:

{
  "tokens": [
    {
      "surface": "The",
      "lemma": "the",
      "pos": "DET",
      "tag": "DT",
      "dep": "det",
      "ent_type": "",
      "morphology": "{'Definite': 'Def', 'PronType': 'Art'}",
      "meaning": "Used to refer to one or more people or things already mentioned or assumed to be common knowledge"
    },
    {
      "surface": "Beatles",
      "lemma": "beatle",
      "pos": "NOUN",
      "tag": "NNS",
      "dep": "nsubj",
      "ent_type": "GPE",
      "morphology": "{'Number': 'Plur'}",
      "meaning": "A British rock band formed in Liverpool in 1960"
    },
    {
      "surface": "released",
      "lemma": "release",
      "pos": "VERB",
      "tag": "VBD",
      "dep": "ROOT",
      "ent_type": "",
      "morphology": "{'Tense': 'Past', 'VerbForm': 'Fin'}",
      "meaning": "To make something available or known to the public"
    },
    {
      "surface": "12",
      "lemma": "12",
      "pos": "NUM",
      "tag": "CD",
      "dep": "nummod",
      "ent_type": "CARDINAL",
      "morphology": "{'NumType': 'Card'}",
      "meaning": "A number representing a quantity"
    },
    {
      "surface": "studio",
      "lemma": "studio",
      "pos": "NOUN",
      "tag": "NN",
      "dep": "compound",
      "ent_type": "",
      "morphology": "{'Number': 'Sing'}",
      "meaning": "A place where creative work is done"
    },
    {
      "surface": "albums",
      "lemma": "album",
      "pos": "NOUN",
      "tag": "NNS",
      "dep": "dobj",
      "ent_type": "",
      "morphology": "{'Number': 'Plur'}",
      "meaning": "A collection of musical or spoken recordings"
    }
  ]
}

GPT Prompting (Generate a Syntaxt Tree using Token Properties)

Ruby code:

require "ruby-spacy"

api_key = ENV["OPENAI_API_KEY"]
nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("The Beatles released 12 studio albums")

# default parameter values
# max_tokens: 1000
# temperature: 0.7
res = doc.openai_query(
  access_token: api_key,
  model: "gpt-4",
  prompt: "Generate a tree diagram from the text using given token data. Use the following bracketing style: [S [NP [Det the] [N cat]] [VP [V sat] [PP [P on] [NP the mat]]]"
)
puts res

Output:

[S
  [NP
    [Det The]
    [N Beatles]
  ]
  [VP
    [V released]
    [NP
      [Num 12]
      [N
        [N studio]
        [N albums]
      ]
    ]
  ]
]

GPT Text Completion

Ruby code:

require "ruby-spacy"

api_key = ENV["OPENAI_API_KEY"]
nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("Vladimir Nabokov was a")

# default parameter values
# max_tokens: 1000
# temperature: 0.7
# model: "gpt-3.5-turbo-0613"
res = doc.openai_completion(access_token: api_key)
puts res

Output:

Russian-American novelist and lepidopterist. He was born in 1899 in St. Petersburg, Russia, and later emigrated to the United States in 1940. Nabokov is best known for his novel "Lolita," which was published in 1955 and caused much controversy due to its controversial subject matter. Throughout his career, Nabokov wrote many other notable works, including "Pale Fire" and "Ada or Ardor: A Family Chronicle." In addition to his writing, Nabokov was also a passionate butterfly collector and taxonomist, publishing several scientific papers on the subject. He passed away in 1977, leaving behind a rich literary legacy.

Text Embeddings

Ruby code:

require "ruby-spacy"

api_key = ENV["OPENAI_API_KEY"]
nlp = Spacy::Language.new("en_core_web_sm")
doc = nlp.read("Vladimir Nabokov was a Russian-American novelist, poet, translator and entomologist.")

# default model: text-embedding-ada-002
res = doc.openai_embeddings(access_token: api_key)

puts res

Output:

-0.00208362
-0.01645165
 0.0110955965
 0.012802119
 0.0012175755
 ...

Author

Yoichiro Hasebe [yohasebe@gmail.com]

Acknowlegments

I would like to thank the following open source projects and their creators for making this project possible:

License

This library is available as open source under the terms of the MIT License.