Home

Awesome

pugz

Parallel decompression of gzipped text files.

Decompresses text files with a truly parallel algorithm in two passes. (paper for details)

Getting Started

A Linux system on a recent x86_64 CPU is required.

Installing

Type:

make

For maximal performance, disable assertions with:

make asserts=0

Usage

./gunzip -t 8 file.gz

Counting lines is incredibly faster, because there is no thread synchronization:

./gunzip -l -t 8 file.gz

Test

We provide a small example:

cd example
bash test.sh

Decompression speed benchmark

Speed is measured in input compressed bytes processed per second.

Threadsgunzippugz, full decompressionpugz, only counting lines
155 MB/s147 MB/s145 MB/s
3-205 MB/s291 MB/s
6-228 MB/s515 MB/s
12-248 MB/s769 MB/s
24-251 MB/s1052 MB/s

Specs: dual Xeon X5675 (2x 6C/12T), 32 GB RAM NUMA, SSD

On a more recent desktop computer (i7-4770 4C/8T, 16 GB RAM, SSD):

Threadsgunzippugz, full decompressionpugz, only counting lines
163 MB/s172 MB/s170 MB/s
8-376 MB/s565 MB/s

Tested on a 2.7 GB compressed file, with similar results on a 24 GB file.

Script: https://github.com/Piezoid/pugz/blob/master/example/bigger_benchmark.sh

Algorithm overview

Contrary to the pigz program which does single-threaded decompression (see https://github.com/madler/pigz/blob/master/pigz.c#L232), pugz found a way to do truly parallel decompression. In a nutshell: the compressed file is splitted into consecutive sections, processed one after the other. Sections are in turn splitted into chunks (one chunk per thread) and will be decompressed in parallel. A first pass decompresses chunks and keeps track of back-references (see e.g. our paper for the definition of that term), but is unable to resolve them. Then, a quick sequential pass is done to resolve the contexts of all chunks. A final parallel pass translates all unresolved back-references and outputs the file.

Roadmap/TODOs

This is a prototype for proof of concept, so expect some rough corners.

If pugz chokes on some of your large files that you are willing to share, please fill a issue !

License

This project is licensed under the MIT License.

Citation

@inproceedings{pugz,
author = {Kerbiriou, Mael and Chikhi, Rayan},
year = {2019},
month = {05},
pages = {209-217},
title = {Parallel Decompression of Gzip-Compressed Files and Random Access to DNA Sequences},
doi = {10.1109/IPDPSW.2019.00042},
booktitle={2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)}
}

Acknowledgements

ebiggers for writing libdeflate