Awesome
#Tfidf An Elixir implementation of tf-idf
Based on the blog post by Steven Loria
##What is tf-idf?
tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining.
Installation
defp deps do
[{:tfidf, "~> 0.1.0"}]
end
Usage
Tfidf.calculate(word, text, corpus, tokenize_fn \\ &tokenize(&1))
Calculates the tf-idf for a given word within a text and a corpus (List) of texts.
iex> Tfidf.calculate("dog", "nice dog dog", ["dog hat", "dog", "cat mat", "duck"])
0.19178804830118723
An optional tokenizer function can be passed as the last argument to replace the default tokenizer:
iex> Tfidf.calculate("dog", "nice,dog,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))
0.19178804830118723
=====
Tfidf.calculate(word, tokenized_text, corpus)
Calculates the tf-idf for a given word within a pre-tokenized list and a corpus comprised of pre-tokenized lists.
iex> Tfidf.calculate("dog", ["nice", "dog", "dog"], [["dog", "hat"], ["dog"], ["cat", "mat"], ["duck"]])
0.19178804830118723
=====
Tfidf.calculate_all(text, corpus, tokenize_fn \\ &tokenize(&1))
Calculates the tf-idf for all words in a given text, returns a list of {word, score} tuples.
iex> Tfidf.calculate_all("nice dog", ["dog hat", "dog", "cat mat", "duck"])
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]
As with Tfidf.calculate/4
an optional tokenizer function can be passed
as the last argument. This will be used in place of the default tokenizer.
iex> Tfidf.calculate_all("nice,dog", ["dog,hat", "dog", "cat,mat", "duck"], &String.split(&1, ","))
[{"nice", 0.6931471805599453}, {"dog", 0.14384103622589042}]