Home

Awesome

<div align="center">

Parallel Random Access to bzip2 and gzip

License C++ Code Checks codecov C++17 Discord Telegram

</div>

This repository contains the code for the indexed_bzip2 and rapidgzip Python modules. Both are built upon the same basic architecture to enable block-parallel decoding based on prefetching and caching.

<div align="center">

rapidgzip

Changelog PyPI version Python Version PyPI Platforms Downloads

</div>

This module provides:

The random seeking support is similar to the one provided by indexed_gzip and the parallel capabilities are effectively a working version of pugz, which is only a concept and only works with a limited subset of file contents, namely non-binary (ASCII characters 0 to 127) compressed files.

ModuleBandwidth / (MB/s)Speedup
gzip2501
rapidgzip with parallelization = 14881.9
rapidgzip with parallelization = 29023.6
rapidgzip with parallelization = 12446317.7
rapidgzip with parallelization = 24524020.8

See here for the extended Readme.

There also exists a dedicated repository for rapidgzip here. It was created for visibility reasons and in order to keep indexed_bzip2 and rapidgzip releases separate. The main development will take place in this repository while the rapidgzip repository will be updated at least for each release. Issues regarding rapidgzip should be opened at its repository.

A paper describing the implementation details and showing the scaling behavior with up to 128 cores has been submitted to and accepted in ACM HPDC'23, The 32nd International Symposium on High-Performance Parallel and Distributed Computing. If you use this software for your scientific publication, please cite it as stated here. The author's version can be found here and the accompanying presentation here.

<div align="center">

indexed_bzip2

Changelog PyPI version Python Version PyPI Platforms Downloads <br> Conda Platforms Conda Platforms

</div>

This module provides:

The parallel decompression capabilities are similar to lbzip2 but with a more permissive license and with support to be used as a library with random seeking capabilities similar to seek-bzip2.

ModuleRuntime / sBandwidth / (MB/s)Speedup
bz23865.21
indexed_bzip2 with parallelization = 14724.20.8
indexed_bzip2 with parallelization = 22657.61.5
indexed_bzip2 with parallelization = 126431.46.1
indexed_bzip2 with parallelization = 246331.86.1

See here for the extended Readme.

License

Licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.