Home

Awesome

Word Count Analyzer

Gem Version Build Status License

See what word count gray areas might be affecting your word count.

Word Count Analyzer is a Ruby gem that analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used. It also provides comprehensive configuration options so you can easily customize how different gray areas should be counted and find the right word count for your purposes.

If you prioritize speed over accuracy, then I recommend not using this gem. There are most definitely faster gems for getting a word count. However, if accuracy is important, and you want control over the gray areas that affect word count, then this gem is for you.

Install

Ruby
Supports Ruby 2.1.0 and above

gem install word_count_analyzer

Ruby on Rails
Add this line to your application’s Gemfile:

gem 'word_count_analyzer'

Live Demo

Try out a live demo of Word Count Analyzer in the browser.

Usage

Analyze the word count gray areas of a string

Common word count gray areas include (more details below):

Other gray areas not covered by this gem:

text = "This string has a date: Monday, November 3rd, 2011. I was thinking... it also shouldn't have too many contractions, maybe 4. <html> Some HTML and a hyphenated-word</html>. Don't count stray punctuation ? ? ? Please visit the ____________ ------------ ........ go-to site: https://www.example-site.com today. Let's add a list 1. item a 2. item b 3. item c. Now let's add he/she/it or a c:\\Users\\john. 2/15/2012 is the date! { HYPERLINK 'http://www.hello.com' }"
WordCountAnalyzer::Analyzer.new.analyze(text)

# =>   {
#        "ellipsis": 1,
#        "hyperlink": 2,
#        "contraction": 4,
#        "hyphenated_word": 2, 
#        "date": 2,
#        "number": 1,
#        "numbered_list": 3,
#        "xhtml": 1,
#        "forward_slash": 1,
#        "backslash": 1,
#        "dotted_line": 1,
#        "dashed_line": 1,
#        "underscore": 1,
#        "stray_punctuation": 5
#      }

Count the words in a string

text = "This string has a date: Monday, November 3rd, 2011. I was thinking... it also shouldn't have too many contractions, maybe 2. <html> Some HTML and a hyphenated-word</html>. Don't count punctuation ? ? ? Please visit the ____________ ------------ ........ go-to site: https://www.example-site.com today. Let's add a list \n\n1. item a \n\n2. item b \n\n3. item c. Now let's add he/she/it or a c:\\Users\\john. 2/15/2012 is the date! { HYPERLINK 'http://www.hello.com' }"

WordCountAnalyzer::Counter.new.count(text)
# => 64

# Overrides all settings to match the way Pages handles word count. 
# N.B. The developers of Pages may change the algorithm at any time so this should just be as an approximation.
WordCountAnalyzer::Counter.new.pages_count(text)
# => 76 (or 79 if the list items are not formatted as a list)

# Overrides all settings to match the way Microsoft Word and wc (Unix) handle word count. 
# N.B. The developers of these tools may change the algorithm at any time so this should just be as an approximation.
WordCountAnalyzer::Counter.new.mword_count(text)
# => 71

# Highly configurable (see all options below)
WordCountAnalyzer::Counter.new(
  ellipsis: 'no_special_treatment',
  hyperlink: 'no_special_treatment',
  contraction: 'count_as_multiple',
  hyphenated_word: 'count_as_multiple',
  date: 'count_as_one',
  number: 'ignore',
  numbered_list: 'ignore',
  xhtml: 'keep',
  forward_slash: 'count_as_multiple',
  backslash: 'count_as_multiple',
  dotted_line: 'count',
  dashed_line: 'count',
  underscore: 'count',
  stray_punctuation: 'count'
).count(text)

# => 77

Counter options

ellipsis

default = 'ignore'

<hr>
hyperlink

default = 'count_as_one'

<hr>
contraction

default = 'count_as_one'

<hr>
hyphenated_word

default = 'count_as_one'

<hr>
date

default = 'no_special_treatment'

<hr>
number

default = 'count'

<hr>
numbered_list

default = 'count'

<hr>
xhtml

default = 'remove'

<hr>
forward_slash

default = 'count_as_multiple_except_dates'

<hr>
backslash

default = 'count_as_one'

<hr>
dotted_line

default = 'ignore'

<hr>
dashed_line

default = 'ignore'

<hr>
underscore

default = 'ignore'

<hr>
stray_punctuation

default = 'ignore'

Gray Area Details

Ellipsis

Checks for any occurrences of ellipses in your text. Writers tend to use different formats for ellipsis, and although there are style guides, it is rare that these rules are followed.

Three Consecutive Periods
...
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1
Four Consecutive Periods
....
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1
Three Periods With Spaces
 . . .
ToolWord Count
Microsoft Word3
Pages0
wc (Unix)3
Four Periods With Spaces
 . . . .
ToolWord Count
Microsoft Word4
Pages0
wc (Unix)4
Horizontal Ellipsis
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1

Hyperlink

http://www.example.com
ToolWord Count
Microsoft Word1
Pages4
wc (Unix)1

Contraction

Most tools count contractions as one word. Some might argue a contraction is technically more than one word.

can't
ToolWord Count
Microsoft Word1
Pages1
wc (Unix)1

Hyphenated Word

devil-may-care
ToolWord Count
Microsoft Word1
Pages3
wc (Unix)1

Date

Most word processing tools do not do recognize dates, but translation CAT tools tend to recognize dates as one word or placeable. This gem checks for many date formats including those that include day or month abbreviations. A few examples are listed below (not an exhaustive list).

Date (example A)
Monday, April 4th, 2011
ToolWord Count
Microsoft Word4
Pages4
wc (Unix)4
Date (example B)
04/04/2011
ToolWord Count
Microsoft Word1
Pages3
wc (Unix)1
Date (example C)
04.04.2011
ToolWord Count
Microsoft Word1
Pages1
wc (Unix)1

Number

Simple number
200
ToolWord Count
Microsoft Word1
Pages1
wc (Unix)1
Number with preceding unit
$200
ToolWord Count
Microsoft Word1
Pages1
wc (Unix)1
Number with unit following
50%
ToolWord Count
Microsoft Word1
Pages1
wc (Unix)1

Numbered List

1. List item a 
2. List item b
3. List item c
ToolWord Count
Microsoft Word12
Pages9
wc (Unix)12

XML and HTML Tags

<span class="large-text">Hello world</span> <new-tag>Hello</new-tag>
ToolWord Count
Microsoft Word4
Pages12
wc (Unix)4

Slashes

Forward slash
she/he/it
ToolWord Count
Microsoft Word1
Pages3
wc (Unix)1
Backslash
c:\Users\johndoe
ToolWord Count
Microsoft Word1
Pages3
wc (Unix)1

Punctuation

Dotted line
.........
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1
………………………
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1
Dashed line
-----------
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1
Underscore
____________
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1
Punctuation mark surrounded by spaces
 : 
ToolWord Count
Microsoft Word1
Pages0
wc (Unix)1

Research

TODO

Contributing

  1. Fork it ( https://github.com/diasks2/word_count_analyzer/fork )
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create a new Pull Request

License

The MIT License (MIT)

Copyright (c) 2015 Kevin S. Dias

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.