Home

Awesome

<!-- based on the a great readme template https://gist.github.com/PurpleBooth/109311bb0361f32d87a2 -->

Slim - surprisingly space efficient data types in Golang

Travis test

Report card Coverage Status

GoDoc PkgGoDev Sourcegraph

Slim is collection of surprisingly space efficient data types, with corresponding serialization APIs to persisting them on-disk or for transport.

<!-- START doctoc generated TOC please keep comment here to allow auto update --> <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> <!-- END doctoc generated TOC please keep comment here to allow auto update -->

Why slim

As data on internet keeps increasing exponentially, the capacity gap between memory and disk becomes greater.

Most of the time, a data itself does not need to be loaded into expensive main memory. Only the much more important information, WHERE-A-DATA-IS, deserve a seat in main memory.

This is what slim does, keeps as little information as possible in main memory, as a minimized index of huge amount external data.

Performance and memory overhead

The data struct in this benchmark is a slice of key-value pairs with a SlimTrie serving as the index. The slim itself is built in the filter mode, to maximize memory reduction and performance. The whole struct slimKV is a fully functional kv-store, just like a static btree.

type slimKV struct {
    slim *trie.SlimTrie
    Elts []*KVElt
}
type KVElt struct {
    Key string
    Val int32
}

You can find the benchmark code in benchmark;

Read more about Performance

Synopsis

1. Index on-disk key-values

One of the typical usages of slim is to index serialized data on disk(e.g., key value records in a SSTable). By keeping a slim in memory, one can quickly find the on-disk offset of the record by a key.

<details> <summary>Show me the code ......</summary>
package index_test

import (
	"fmt"
	"strings"

	"github.com/openacid/slim/index"
)

type Data string

func (d Data) Read(offset int64, key string) (string, bool) {
	kv := strings.Split(string(d)[offset:], ",")[0:2]
	if kv[0] == key {
		return kv[1], true
	}
	return "", false
}

func Example() {

	// Accelerate external data accessing (in memory or on disk) by indexing
	// them with a SlimTrie:

	// `data` is a sample of some unindexed data. In our example it is a comma
	// separated key value series.
	//
	// In order to let SlimTrie be able to read data, `data` should have
	// a `Read` method:
	//     Read(offset int64, key string) (string, bool)
	data := Data("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")

	// keyOffsets is a prebuilt index that stores key and its offset in data accordingly.
	keyOffsets := []index.OffsetIndexItem{
		{Key: "Aaron", Offset: 0},
		{Key: "Agatha", Offset: 8},
		{Key: "Al", Offset: 17},
		{Key: "Albert", Offset: 22},
		{Key: "Alexander", Offset: 31},
		{Key: "Alison", Offset: 43},
	}

	// `SlimIndex` is simply a container of SlimTrie and its data.
	st, err := index.NewSlimIndex(keyOffsets, data)
	if err != nil {
		fmt.Println(err)
	}

	// Lookup
	v, found := st.Get("Alison")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Alison", found, v)

	v, found = st.Get("foo")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "foo", found, v)

	// Output:
	// key: "Alison"
	//   found: true
	//   value: "8"
	// key: "foo"
	//   found: false
	//   value: ""
}
</details>

2. Sparse index

Create an index item for every 4(or more as you wish) keys.

Let several adjacent keys share one index item reduces a lot memory cost if there are huge amount keys in external data. Such as to index billions of 4KB objects on a 4TB disk(because one disk IO costs 20ms for either reading 4KB or reading 1MB).

<details> <summary>Show me the code ......</summary>
package index_test

import (
	"fmt"
	"strings"

	"github.com/openacid/slim/index"
)

type RangeData string

func (d RangeData) Read(offset int64, key string) (string, bool) {
	for i := 0; i < 4; i++ {
		if int(offset) >= len(d) {
			break
		}

		kv := strings.Split(string(d)[offset:], ",")[0:2]
		if kv[0] == key {
			return kv[1], true
		}
		offset += int64(len(kv[0]) + len(kv[1]) + 2)

	}
	return "", false
}

func Example_indexRanges() {

	// Index ranges instead of keys:
	// In this example at most 4 keys shares one index item.

	data := RangeData("Aaron,1,Agatha,1,Al,2,Albert,3,Alexander,5,Alison,8")

	// keyOffsets is a prebuilt index that stores range start, range end and its offset.
	keyOffsets := []index.OffsetIndexItem{
		// Aaron  +--> 0
		// Agatha |
		// Al     |
		// Albert |

		// Alexander +--> 31
		// Alison    |

		{Key: "Aaron", Offset: 0},
		{Key: "Agatha", Offset: 0},
		{Key: "Al", Offset: 0},
		{Key: "Albert", Offset: 0},

		{Key: "Alexander", Offset: 31},
		{Key: "Alison", Offset: 31},
	}

	st, err := index.NewSlimIndex(keyOffsets, data)
	if err != nil {
		panic(err)
	}

	v, found := st.RangeGet("Aaron")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Aaron", found, v)

	v, found = st.RangeGet("Al")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "Al", found, v)

	v, found = st.RangeGet("foo")
	fmt.Printf("key: %q\n  found: %t\n  value: %q\n", "foo", found, v)

	// Output:
	// key: "Aaron"
	//   found: true
	//   value: "1"
	// key: "Al"
	//   found: true
	//   value: "2"
	// key: "foo"
	//   found: false
	//   value: ""
}
</details>

3. Range scan

Slim can also be used as a traditional in-memory kv-store: Building a slim with Opt{ Complete: Bool(true) }, it won't strip out any information(e.g., it won't eliminate single-branch labels) and it will functions the same as a btree. This snippet shows how to iterate key values.

<details> <summary>Show me the code ......</summary>
package trie

import (
	"fmt"

	"github.com/openacid/slim/encode"
)

func ExampleSlimTrie_ScanFrom() {
	var keys = []string{
		"",
		"`",
		"a",
		"ab",
		"abc",
		"abca",
		"abcd",
		"abcd1",
		"abce",
		"be",
		"c",
		"cde0",
		"d",
	}
	values := makeI32s(len(keys))

	codec := encode.I32{}
	st, _ := NewSlimTrie(codec, keys, values, Opt{
		Complete: Bool(true),
	})

	// untilD stops when encountering "d".
	untilD := func(k, v []byte) bool {
		if string(k) == "d" {
			return false
		}

		_, i32 := codec.Decode(v)
		fmt.Println(string(k), i32)
		return true
	}

	fmt.Println("scan (ab, +∞):")
	st.ScanFrom("ab", false, true, untilD)

	fmt.Println()
	fmt.Println("scan [be, +∞):")
	st.ScanFrom("be", true, true, untilD)

	fmt.Println()
	fmt.Println("scan (ab, be):")
	st.ScanFromTo(
		"ab", false,
		"be", false,
		true, untilD)

	// Output:
	//
	// scan (ab, +∞):
	// abc 4
	// abca 5
	// abcd 6
	// abcd1 7
	// abce 8
	// be 9
	// c 10
	// cde0 11
	//
	// scan [be, +∞):
	// be 9
	// c 10
	// cde0 11
	//
	// scan (ab, be):
	// abc 4
	// abca 5
	// abcd 6
	// abcd1 7
	// abce 8
}
</details>

Filter mode and KV mode.

Slim can be built into either a filter(like bloom filter but with key order preserved.) or a real kv-store(like btree) There is an option in NewSlimTrie(..., option) to control the building behavior. Ref: Opt

The memory consumption in filter mode and kv mode differs significantly. The following chart shows memory consumption by 1 million var-length string, 10 to 20 byte in different mode:

-sizegzip-size
sample data size15.0M14.0M
Complete:true14.0M10.0M
InnerPrefix:ture1.3M0.9M
all false1.3M0.8M
<!-- ## FAQ -->

Try it

Install

go get github.com/openacid/slim/trie

Change-log: Change-log

Versions

A newer version y being compatible with an older version x means y can load data serialized by x. But x should never try to load data serialized by a newer version y.

<!-- TODO add FAQ --> <!-- TODO add serialization explanation, on-disk data structure etc. -->

Who are using slim

<span> <span> </span> <span> baishancloud </span> </span>

<!-- ## Slim internal --> <!-- ### Built With --> <!-- - [protobuf][] - Define on-disk data-structure and serialization engine. --> <!-- - [dep][] - Dependency Management. --> <!-- - [semver][] - For versioning data-structure. --> <!-- ### Directory Layout --> <!-- We follow the: [golang-standards-project-layout][]. --> <!-- [> TODO read the doc and add more standards <] --> <!-- - `vendor/`: dependency packages. --> <!-- - `prototype/`: on-disk data-structure. --> <!-- - `docs/`: documents about design, trade-off, etc --> <!-- - `tools/`: documents about design, trade-off, etc --> <!-- - `expamples/`: documents about design, trade-off, etc --> <!-- Other directories are sub-package. --> <!-- ### Versioning --> <!-- We use [SemVer](http://semver.org/) for versioning. --> <!-- For the versions available, see the [tags on this repository](https://github.com/your/project/tags). --> <!-- ### Data structure explained --> <!-- [> TODO <] --> <!-- ## Limitation --> <!-- [> TODO <] --> <!-- - [ ] bitrie: 1 byte-per-key implementation. --> <!-- - [ ] balanced bitrie: which gives better worst-case performance. --> <!-- - [ ] generalised API as a drop-in replacement for map etc. -->

Feedback and contributions

Feedback and Contributions are greatly appreciated.

At this stage, the maintainers are most interested in feedback centered on:

Let us know by filing an issue, describing what you did or wanted to do, what you expected to happen, and what actually happened:

Or other type of issue.

<!-- ## Contributing --> <!-- The maintainers actively manage the issues list, and try to highlight issues --> <!-- suitable for newcomers. --> <!-- [> TODO dep CONTRIBUTING <] --> <!-- The project follows the typical GitHub pull request model. See CONTRIBUTING.md for more details. --> <!-- Before starting any work, please either comment on an existing issue, --> <!-- or file a new one. --> <!-- [> TODO <] --> <!-- Please read [CONTRIBUTING.md][] --> <!-- for details on our code of conduct, and the process for submitting pull requests to us. --> <!-- https://gist.github.com/PurpleBooth/b24679402957c63ec426 --> <!-- ### Code style --> <!-- ### Tool chain --> <!-- ### Customized install --> <!-- Alternatively, if you have a customized go develop environment, you could also --> <!-- clone it: --> <!-- ```sh --> <!-- git clone git@github.com:openacid/slim.git --> <!-- ``` --> <!-- As a final step you'd like have a test to see if everything goes well: --> <!-- ```sh --> <!-- cd path/to/slim/build/pseudo-gopath --> <!-- export GOPATH=$(pwd) --> <!-- go test github.com/openacid/slim/array --> <!-- ``` --> <!-- Another reason to have a `pseudo-gopath` in it is that some tool have their --> <!-- own way conducting source code tree. --> <!-- E.g. [git-worktree](https://git-scm.com/docs/git-worktree) --> <!-- checkouts source code into another dir other than the GOPATH work space. --> <!-- ## Update dependency --> <!-- Dependencies are tracked by [dep](https://github.com/golang/dep). --> <!-- All dependencies are kept in `vendor/` dir thus you do not need to do anything --> <!-- to run it. --> <!-- You need to update dependency only when you bring in new feature with other dependency. --> <!-- - Install `dep` --> <!-- ``` --> <!-- curl https://raw.githubusercontent.com/golang/dep/master/install.sh | sh --> <!-- ``` --> <!-- - Download dependency --> <!-- ``` --> <!-- dep ensure --> <!-- ``` --> <!-- > dep uses Gopkg.toml Gopkg.lock to track dependency info. --> <!-- > --> <!-- > Gopkg.toml Gopkg.lock is created with `dep init`. --> <!-- > --> <!-- > dep creates a `vendor` dir to have all dependency package there. --> <!-- See more: [dep-install][] -->

Authors

<!-- ordered by unicode of author's name --> <!-- leave 3 to 5 major jobs you have done in this project -->

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE file for details.

<!-- ## Acknowledgments --> <!-- [> TODO <] --> <!-- - Hat tip to anyone whose code was used --> <!-- - Inspiration --> <!-- patricial tree --> <!-- fusion tree --> <!-- critic trie --> <!-- - etc --> <!-- links --> <!-- Bio --> <!-- avatar --> <!-- issue links --> <!-- benchmark --> <!-- links to other resource --> <!-- reference -->