Home

Awesome

prose Build Status GoDoc Coverage Status Go Report Card codebeat badge Awesome

prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech tagging, and named-entity extraction.

You can find a more detailed summary on the library's performance here: Introducing prose v2.0.0: Bringing NLP to Go.

Installation

$ go get github.com/jdkato/prose/v2

Usage

Contents

Overview

package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("Go is an open-source programming language created at Google.")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag, tok.Label)
        // Go NNP B-GPE
        // is VBZ O
        // an DT O
        // ...
    }

    // Iterate over the doc's named-entities:
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Go GPE
        // Google GPE
    }

    // Iterate over the doc's sentences:
    for _, sent := range doc.Sentences() {
        fmt.Println(sent.Text)
        // Go is an open-source programming language created at Google.
    }
}

The document-creation process adheres to the following sequence of steps:

tokenization -> POS tagging -> NE extraction
            \
             segmentation

Each step may be disabled (assuming later steps aren't required) by passing the appropriate functional option. To disable named-entity extraction, for example, you'd do the following:

doc, err := prose.NewDocument(
        "Go is an open-source programming language created at Google.",
        prose.WithExtraction(false))

Tokenizing

prose includes a tokenizer capable of processing modern text, including the non-word character spans shown below.

TypeExample
Email addressesJane.Doe@example.com
Hashtags#trending
Mentions@jdkato
URLshttps://github.com/jdkato/prose
Emoticons:-), >:(, o_0, etc.
package main

import (
    "fmt"
    "log"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, err := prose.NewDocument("@jdkato, go to http://example.com thanks :).")
    if err != nil {
        log.Fatal(err)
    }

    // Iterate over the doc's tokens:
    for _, tok := range doc.Tokens() {
        fmt.Println(tok.Text, tok.Tag)
        // @jdkato NN
        // , ,
        // go VB
        // to TO
        // http://example.com NN
        // thanks NNS
        // :) SYM
        // . .
    }
}

Segmenting

prose includes one of the most accurate sentence segmenters available, according to the Golden Rules created by the developers of the pragmatic_segmenter.

NameLanguageLicenseGRS (English)GRS (Other)Speed†
Pragmatic SegmenterRubyMIT98.08% (51/52)100.00%3.84 s
proseGoMIT75.00% (39/52)N/A0.96 s
TactfulTokenizerRubyGNU GPLv365.38% (34/52)48.57%46.32 s
OpenNLPJavaAPLv259.62% (31/52)45.71%1.27 s
Standford CoreNLPJavaGNU GPLv359.62% (31/52)31.43%0.92 s
SplittaPythonAPLv255.77% (29/52)37.14%N/A
PunktPythonAPLv246.15% (24/52)48.57%1.79 s
SRX EnglishRubyGNU GPLv330.77% (16/52)28.57%6.19 s
ScapelRubyGNU GPLv328.85% (15/52)20.00%0.13 s

† The original tests were performed using a MacBook Pro 3.7 GHz Quad-Core Intel Xeon E5 running 10.9.5, while prose was timed using a MacBook Pro 2.9 GHz Intel Core i7 running 10.13.3.

package main

import (
    "fmt"
    "strings"

    "github.com/jdkato/prose/v2"
)

func main() {
    // Create a new document with the default configuration:
    doc, _ := prose.NewDocument(strings.Join([]string{
        "I can see Mt. Fuji from here.",
        "St. Michael's Church is on 5th st. near the light."}, " "))

    // Iterate over the doc's sentences:
    sents := doc.Sentences()
    fmt.Println(len(sents)) // 2
    for _, sent := range sents {
        fmt.Println(sent.Text)
        // I can see Mt. Fuji from here.
        // St. Michael's Church is on 5th st. near the light.
    }
}

Tagging

prose includes a tagger based on Textblob's "fast and accurate" POS tagger. Below is a comparison of its performance against NLTK's implementation of the same tagger on the Treebank corpus:

LibraryAccuracy5-Run Average (sec)
NLTK0.8937.224
prose0.9612.538

(See scripts/test_model.py for more information.)

The full list of supported POS tags is given below.

TAGDESCRIPTION
(left round bracket
)right round bracket
,comma
:colon
.period
''closing quotation mark
``opening quotation mark
#number sign
$currency
CCconjunction, coordinating
CDcardinal number
DTdeterminer
EXexistential there
FWforeign word
INconjunction, subordinating or preposition
JJadjective
JJRadjective, comparative
JJSadjective, superlative
LSlist item marker
MDverb, modal auxiliary
NNnoun, singular or mass
NNPnoun, proper singular
NNPSnoun, proper plural
NNSnoun, plural
PDTpredeterminer
POSpossessive ending
PRPpronoun, personal
PRP$pronoun, possessive
RBadverb
RBRadverb, comparative
RBSadverb, superlative
RPadverb, particle
SYMsymbol
TOinfinitival to
UHinterjection
VBverb, base form
VBDverb, past tense
VBGverb, gerund or present participle
VBNverb, past participle
VBPverb, non-3rd person singular present
VBZverb, 3rd person singular present
WDTwh-determiner
WPwh-pronoun, personal
WP$wh-pronoun, possessive
WRBwh-adverb

NER

prose v2.0.0 includes a much improved version of v1.0.0's chunk package, which can identify people (PERSON) and geographical/political Entities (GPE) by default.

package main

import (
    "github.com/jdkato/prose/v2"
)

func main() {
    doc, _ := prose.NewDocument("Lebron James plays basketball in Los Angeles.")
    for _, ent := range doc.Entities() {
        fmt.Println(ent.Text, ent.Label)
        // Lebron James PERSON
        // Los Angeles GPE
    }
}

However, in an attempt to make this feature more useful, we've made it straightforward to train your own models for specific use cases. See Prodigy + prose: Radically efficient machine teaching in Go for a tutorial.