Home

Awesome

<div style="text-align:center"> <img alt="Pco logo: a pico-scale, compressed version of the Pyramid of Khafre in the palm of your hand" src="images/logo.svg" width="160px"> </div>

crates.io pypi.org

Pcodec

<div style="text-align:center"> <img alt="bar charts showing better compression for Pco than zstd parquet or blosc" src="images/real_world_compression_ratio.svg" width="700px" > </div>

Pcodec (or Pco, pronounced "pico") losslessly compresses and decompresses numerical sequences with high compression ratio and moderately fast speed.

Use cases include:

Data types: u16, u32, u64, i16, i32, i64, f16, f32, f64

Get Started

Use the CLI (also supports benchmarking)

Use the Rust API

Use the Python API

How is Pco so much better than alternatives?

Pco is designed specifically for numerical data, whereas alternatives rely on general-purpose (LZ) compressors that were designed for string or binary data. Pco uses a holistic, 3-step approach:

These 3 steps cohesively capture most entropy of numerical data without waste.

In contrast, LZ compressors are only effective for patterns like repeating exact sequences of numbers. Such patterns constitute just a small fraction of most numerical data's entropy.

Usage Details

Wrapped or Standalone

Pco is designed to be easily wrapped into another format. It provides a powerful wrapped API with the building blocks to interleave it with the wrapping format. This is useful if the wrapping format needs to support things like nullability, multiple columns, random access or seeking.

The standalone format is a minimal implementation of a wrapped format. It supports batched decompression only with no other niceties. It is mainly recommended for quick proofs of concept and benchmarking.

Granularity

Pco has a hierarchy of multiple batches per page; multiple pages per chunk; and multiple chunks per file.

unit of ___size for good compression
chunkcompression>10k numbers
pageinterleaving w/ wrapping format>1k numbers
batchdecompression256 numbers (fixed)

Mistakes to Avoid

You will get disappointing results from Pco if your data:

Example: the NYC taxi dataset has f64 columns for passenger_base_fare and tolls. Suppose we assign these as fare[0...n] and tolls[0...n] respectively, where n=50,000.

Similarly, we could compress images by making a separate chunk for each flattened channel (red, green, blue). Though dedicated formats like webp likely compress natural images better.

Extra

Docs

benchmarks: see the results

format specification

terminology

Quantile Compression: Pcodec's predecessor

contributing guide

Community

join the Discord