Home

Awesome

Ngram Indexing

GitHub license GoDoc Go

This package provide a simple way to "Nrigram Indexing" in input document. It is refer from an article - Google Code Search.

Here is the introduction what is "trigram indexing" and how Google Code Search use it for search but it is in Chinese :) .

Performance Optimization

This package base on my another project Trigram, but it got better performance(~3 times). (refer Benchmark)

It has done as follow:

How it works

This package using trigram indexing to get all trigram in input string (what we call document).

Here is some trigram rule as follow:

Install

go get github.com/kkdai/ngram

Usage


package main

import (
	"fmt"
	. "github.com/kkdai/ngram"
	)
func main() {	
	//Currently is support Twogram, Trigram and Fourgram
	ti := NewNgramIndex(Trigram)
	ti.Add("Code is my life")			//doc 1
	ti.Add("Search")						//doc 2
	ti.Add("I write a lot of Codes") //doc 3
	
	//Print all trigram map 
	fmt.Println("It has ", len(ti.TrigramMap))
	for k, v := range ti.TrigramMap {
		fmt.Println("trigram=", k, " obj=", v)
	}

	//Search which doc include this code
	ret := ti.Query("Code")
	fmt.Println("Query ret=", ret)
	// [1, 3]
}

Benchmark

Still working to improve the query time.

//Original benchmark in trigram
BenchmarkAddTwogram-4    	  200000	      7151 ns/op
BenchmarkAddTrigram-4    	  300000	      6713 ns/op
BenchmarkAddFourgran-4   	  300000	      5813 ns/op
BenchmarkDeleteTwogram-4 	  500000	      4591 ns/op
BenchmarkDeleteTrigram-4 	  500000	      3695 ns/op
BenchmarkDeleteFourgram-4	  500000	      3297 ns/op
BenchmarkQueryTwogran-4  	   10000	   8361813 ns/op
BenchmarkQueryTrigran-4  	   10000	   7650419 ns/op
BenchmarkQueryFourgram-4 	   10000	   6975925 ns/op


//Optimize result
BenchmarkAddTwogram-4    	  300000	      5737 ns/op
BenchmarkAddTrigram-4    	  500000	      4795 ns/op
BenchmarkAddFourgran-4   	  500000	      4158 ns/op
BenchmarkDeleteTwogram-4 	   20000	    167246 ns/op
BenchmarkDeleteTrigram-4 	   20000	    148756 ns/op
BenchmarkDeleteFourgram-4	   20000	    128022 ns/op
BenchmarkQueryTwogran-4  	   10000	   2461910 ns/op
BenchmarkQueryTrigran-4  	   10000	   2276625 ns/op
BenchmarkQueryFourgram-4 	   10000	   2172323 ns/op

BTW: Here is benchmark for https://github.com/dgryski/go-trigram for my improvement record:

BenchmarkAdd-4       1000000          1063 ns/op
BenchmarkDelete-4     100000        140392 ns/op
BenchmarkQuery-4       10000        474320 ns/op

Inspired

Project52

It is one of my project 52.

License

This package is licensed under MIT license. See LICENSE for details.