Home

Awesome

Textoken

Build Status Coverage Status Code Climate Gem Version

Textoken is a Ruby library for text tokenization. This gem extracts words from text with many customizations. It can be used in many fields like Web Crawling and Natural Language Processing.

## Basic Usage

require 'textoken'

Textoken('Software is like sex: it\'s better when it\'s free. \'Linus Tolvards\'').tokens
# => ["Software", "is", "like", "sex", ":", "it", "'", "s", "better", "when", "it", "'", "s", "free", ".", "'", "Linus", "Tolvards", "'"]

Textoken('Oh, no! Alfa is at home.').tokens
# => ["Oh", ",", "no", "!", "Alfa", "is", "at", "home", "."]

Textoken('Oh, no! Alfa is at home.').words
# => ["Oh,", "no!", "Alfa", "is", "at", "home."]

## Customization

require 'textoken'

Textoken('Oh, no! Alfa is at home.', only: 'punctuations').tokens
# => ["Oh", ",", "no", "!", "home", "."]

Textoken('Oh, no! Alfa is at home.', exclude: 'punctuations', more_than: 3).tokens
# => ["Alfa"]

Textoken('Oh, no! Alfa is at 01/01/2000 with $1000.', only: 'dates, numerics').words
# => ["01/01/2000", "$1000."]

Textoken('Oh, no! Alfa 2000 is at home.', only_regexp: '^[0-9]*$').tokens
# => ["2000"]

You can combine all options. 'Only' and 'Exclude' Options support multiple option values like only: 'punctuations, dates, numerics'

Public interface of Textoken presents two methods, tokens & words

Textoken('Alfa.').tokens
# => ["Alfa", "."]
# => splits punctuations by default whereas,

Textoken('Alfa.').words
# => ["Alfa."]
# => does not split punctuations.

Current Options

## Option Meanings

Installation

Add this line to your application's Gemfile:

gem 'textoken'

And then execute:

$ bundle

Or install it yourself as:

$ gem install textoken

Supported Ruby Versions

This library aims to support and is tested against the following Ruby implementations:

If something doesn't work on one of these versions, it's a bug. This library may also work (or seem to work) on other Ruby versions or implementations, however support will only be provided for the implementations listed above.

Contributing

Feel free to add any regepx to lib/regexps/option_values.yml but please add a simple test to 'single options' part at textoken_spec.rb

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request