Awesome
<div align="center">Parallel Random Access to bzip2 and gzip
</div>This repository contains the code for the indexed_bzip2
and rapidgzip
Python modules.
Both are built upon the same basic architecture to enable block-parallel decoding based on prefetching and caching.
rapidgzip
</div>This module provides:
- a
rapidgzip
command line tool for parallel decompression of gzip files with a similar command line interface togzip
so that it can be used as a replacement. - a
rapidgzip.open
Python method for reading and seeking inside gzip files using multiple threads for a speedup of 21 over the built-in gzip module using a 12-core processor.
The random seeking support is similar to the one provided by indexed_gzip and the parallel capabilities are effectively a working version of pugz, which is only a concept and only works with a limited subset of file contents, namely non-binary (ASCII characters 0 to 127) compressed files.
Module | Bandwidth / (MB/s) | Speedup |
---|---|---|
gzip | 250 | 1 |
rapidgzip with parallelization = 1 | 488 | 1.9 |
rapidgzip with parallelization = 2 | 902 | 3.6 |
rapidgzip with parallelization = 12 | 4463 | 17.7 |
rapidgzip with parallelization = 24 | 5240 | 20.8 |
See here for the extended Readme.
There also exists a dedicated repository for rapidgzip here. It was created for visibility reasons and in order to keep indexed_bzip2 and rapidgzip releases separate. The main development will take place in this repository while the rapidgzip repository will be updated at least for each release. Issues regarding rapidgzip should be opened at its repository.
A paper describing the implementation details and showing the scaling behavior with up to 128 cores has been submitted to and accepted in ACM HPDC'23, The 32nd International Symposium on High-Performance Parallel and Distributed Computing. If you use this software for your scientific publication, please cite it as stated here. The author's version can be found here and the accompanying presentation here.
<div align="center">indexed_bzip2
</div>This module provides:
- an
ibzip2
command line tool to decompress bzip2 files in parallel with a similar command line interface tobzip2
so that it can be used as a replacement. - an
ibzip2.open
Python method for reading and seeking inside bzip2 files using multiple threads for a speedup of 6 over the built-in bzip2 module using a 12-core processor.
The parallel decompression capabilities are similar to lbzip2 but with a more permissive license and with support to be used as a library with random seeking capabilities similar to seek-bzip2.
Module | Runtime / s | Bandwidth / (MB/s) | Speedup |
---|---|---|---|
bz2 | 386 | 5.2 | 1 |
indexed_bzip2 with parallelization = 1 | 472 | 4.2 | 0.8 |
indexed_bzip2 with parallelization = 2 | 265 | 7.6 | 1.5 |
indexed_bzip2 with parallelization = 12 | 64 | 31.4 | 6.1 |
indexed_bzip2 with parallelization = 24 | 63 | 31.8 | 6.1 |
See here for the extended Readme.
License
Licensed under either of
- Apache License, Version 2.0, (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Contribution
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.