Home

Awesome

zarrs-python

⚠️ The version of `zarr-python` we currently depend on is still in pre-release and this
package is accordingly extremely experimental.
We cannot guarantee any stability or correctness at the moment, although we have
tried to do extensive testing and make clear what we think we support and do not.

This project serves as a bridge between zarrs and zarr via PyO3. The main goal of the project is to speed up i/o.

To use the project, simply install our package (which depends on zarr-python>3.0.0b0), and run:

import zarr
import zarrs
zarr.config.set({"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"})

You can then use your zarr as normal (with some caveats)!

API

We export a ZarrsCodecPipeline class so that zarr-python can use the class but it is not meant to be instantiated and we do not guarantee the stability of its API beyond what is required so that zarr-python can use it. Therefore, it is not documented here. We also export two errors, DiscontiguousArrayError and CollapsedDimensionError that can be thrown in the process of converting to indexers that zarrs can understand (see below for more details).

Configuration

ZarrsCodecPipeline options are exposed through zarr.config.

Standard zarr.config options control some functionality (see the defaults in the config.py of zarr-python):

The ZarrsCodecPipeline specific options are:

For example:

zarr.config.set({
    "threading.max_workers": None,
    "array.write_empty_chunks": False,
    "codec_pipeline": {
        "path": "zarrs.ZarrsCodecPipeline",
        "validate_checksums": True,
        "store_empty_chunks": False,
        "chunk_concurrent_maximum": None,
        "chunk_concurrent_minimum": 4,
    }
})

Concurrency

Concurrency can be classified into two types:

The product of the chunk and codec concurrency will approximately match threading.max_workers.

Chunk concurrency is typically favored because:

zarrs-python will often favor codec concurrency with sharded arrays, as they are well suited to codec concurrency.

Supported Indexing Methods

We do not officially support the following indexing methods. Some of these methods may error out, others may not:

  1. Any oindex or vindex integer np.ndarray indexing with dimensionality >=3 i.e.,

    arr[np.array([...]), :, np.array([...])]
    arr[np.array([...]), np.array([...]), np.array([...])]
    arr[np.array([...]), np.array([...]), np.array([...])] = ...
    arr.oindex[np.array([...]), np.array([...]), np.array([...])] = ...
    
  2. Any vindex or oindex discontinuous integer np.ndarray indexing for writes in 2D

    arr[np.array([0, 5]), :] = ...
    arr.oindex[np.array([0, 5]), :] = ...
    
  3. vindex writes in 2D where both indexers are integer np.ndarray indices i.e.,

    arr[np.array([...]), np.array([...])] = ...
    
  4. Ellipsis indexing. We have tested some, but others fail even with zarr-python's default codec pipeline. Thus for now we advise proceeding with caution here.

    arr[0:10, ..., 0:5]
    

Otherwise, we believe that we support your indexing case: slices, ints, and all integer np.ndarray indices in 2D for reading, contiguous integer np.ndarray indices along one axis for writing etc. Please file an issue if you believe we have more holes in our coverage than we are aware of or you wish to contribute! For example, we have an issue in zarrs for integer-array indexing that would unblock a lot of these issues!

That being said, using non-contiguous integer np.ndarray indexing for reads may not be as fast as expected given the performance of other supported methods. Until zarrs supports integer indexing, only fetching chunks is done in rust while indexing then occurs in python.