Awesome
Aho-Corasick
Implementation of the Aho-Corasick string-search algorithm in Go.
Licensed under MIT License.
Details
This implementation does not use a Double-Array Trie as in my implementation from a couple of years back.
This reduces the build time drastically, but at the cost of higher memory consumption.
The search time is still fast, and comparable to other Go implementations I have found on github that claims to be fast (see performance).
Documentation
Can be found at godoc.org.
Example Usage
Use a TrieBuilder
to build a Trie
:
trie := NewTrieBuilder().
AddStrings([]string{"or", "amet"}).
Build()
Then go and match something interesting:
matches := trie.MatchString("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
fmt.Printf("Got %d matches.\n", len(matches))
// => Got 3 matches.
What did we match?
for _, match := range matches {
fmt.Printf("Matched pattern %d %q at position %d.\n", match.Match(),
match.Pattern(), match.Pos())
}
// => Matched pattern 0 "or" at position 1.
// => Matched pattern 0 "or" at position 15.
// => Matched patterh 1 "amet" at position 22.
Building
You can easily load patterns from file:
builder := NewTrieBuilder()
builder.LoadPatterns("patterns.txt")
builder.LoadStrings("strings.txt")
Both functions expects a text file with one pattern per line. LoadPatterns
expects the pattern to
be in hexadecimal form.
Storing
Use Encode
to store a Trie
in gzip compressed binary format:
f, err := os.Create("trie.gz")
err := Encode(f, trie)
And Decode
to load it from binary format:
f, err := os.Open("trie.gz")
trie, err := Decode(f)
Performance
Some simple benchmarking on my machine (Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz, 32 GiB RAM).
Build and search time grows quite linearly with regards to number of patterns and input text length.
Building
BenchmarkTrieBuild/100-4 7886 154786 ns/op
BenchmarkTrieBuild/1000-4 739 1647419 ns/op
BenchmarkTrieBuild/10000-4 91 13331713 ns/op
BenchmarkTrieBuild/100000-4 9 123886615 ns/op
Searching
BenchmarkMatchIbsen/100-4 1471089 819 ns/op
BenchmarkMatchIbsen/1000-4 202288 5667 ns/op
BenchmarkMatchIbsen/10000-4 19957 59680 ns/op
BenchmarkMatchIbsen/100000-4 2012 595086 ns/op
Compared to Other Implementation
Memory Usage
As mentioned, the memory consumption will be quite high compared to a double-array trie implementation. Especially during the build phase (which currently contains a lot of object allocations).