Home

Awesome

minz: A minimal compressor

minz is a minimal string compressor based on the paper FSST: Fast Random Access String Compression.

The compressed format is very simple: It uses a pre-computed dictionary of 255 entries, each word being at most 8 bytes long. Bytes 0x00 to 0xFE adds a word from the dictionary, while byte 0xFF is an escape character which adds the next character as-is.

Example: If the dictionary contains 0x00 = hello and 0x01 = world, then 0x00 0xFF 0x20 0x01 0xFF 0x21 (six bytes) decompresses into hello world!.

This has the following characteristics:

Usage

minz is currently provided as a library in Zig. There's no documentation and you'll have to look at the public functions and test cases.

There's also a small command-line tool which reads in a file, trains a dictionary (from 1% of the lines), compresses each line separately, and then reports the total ratio:

$ zig build
$ ./zig-out/bin/line-compressor access.log
Reading file: access.log
Read 689253 lines.
Training...
Compressing...
Uncompressed: 135114557
Compressed:   46209436
Ratio: 2.9239603140795745

Current status

This is just a learning project for me to personally learn the algorithm in the paper. It's not being used in any production systems, and I'm not actively developing it.

In addition, the dictionary-training algorithm presented in the paper is actually a bit vague on the exact details. There is some choice in how you combine symbols and right now it doesn't seem to create an "optimal" dictionary according to human inspection. If you intend to use this for a "real" project you'll probably have to invest some more time.

Roadmap / pending work