Home

Awesome

<!-- THIS FILE IS AUTOGENERATED BY README.ipynb. Ideally you should edit README.ipynb and use 'generate-readme' to produce README.md. But it's okay to edit README.md too directly if you want to fix something -- I can run generate-readme myself later. -->

What is Cachew?

TLDR: cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to functools.lru_cache). The difference from functools.lru_cache is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache. Cache is invalidated automatically if your function's arguments change, so you don't have to think about maintaining it.

In order to be cacheable, your function needs to return a simple data type, or an Iterator over such types.

A simple type is defined as:

That allows to automatically infer schema from type hints (PEP 526) and not think about serializing/deserializing. Thanks to type hints, you don't need to annotate your classes with any special decorators, inherit from some special base classes, etc., as it's often the case for serialization libraries.

Motivation

I often find myself processing big chunks of data, merging data together, computing some aggregates on it or extracting few bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.

Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files, comparing on the next run and returning cached data if nothing changed.

Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.

Examples

Processing Wikipedia

Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive. Parsing it (extract_links function) takes hours, however, as long as the archive is same you will always get same results. So it would be nice to be able to cache the results somehow.

With this library your can achieve it through single @cachew decorator.

>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
...     url : str
...     text: str
...
>>> @cachew
... def extract_links(archive_path: str) -> Iterator[Link]:
...     for i in range(5):
...         # simulate slow IO
...         # this function runs for five seconds for the purpose of demonstration, but realistically it might take hours
...         import time; time.sleep(1)
...         yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive_path='wikipedia_20190830.zip')) # that would take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]

>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive_path='wikipedia_20190830.zip'))).timeit(number=1)
... # second run is cached, so should take less time
>>> print(f"call took {int(res)} seconds")
call took 0 seconds

>>> res = Timer(lambda: list(extract_links(archive_path='wikipedia_20200101.zip'))).timeit(number=1)
... # now file has changed, so the cache will be discarded
>>> print(f"call took {int(res)} seconds")
call took 5 seconds

When you call extract_links with the same archive, you start getting results in a matter of milliseconds, as fast as sqlite reads it.

When you use newer archive, archive_path changes, which will make cachew invalidate old cache and recompute it, so you don't need to think about maintaining it separately.

Incremental data exports

This is my most common usecase of cachew, which I'll illustrate with example.

I'm using an environment sensor to log stats about temperature and humidity. Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements). That means that I end up with a new database every few days, each of them containing only a slice of data I need, e.g.:

...
20190715100026.db
20190716100138.db
20190717101651.db
20190718100118.db
20190719100701.db
...

To access all of historic temperature data, I have two options:

Cachew gives the best of two worlds and makes it both easy and efficient. The only thing you have to do is to decorate your function:

@cachew      
def measurements(chunks: List[Path]) -> Iterator[Measurement]:
    # ...
    

How it works

When the function is called, cachew computes the hash of your function's arguments and compares it against the previously stored hash value.

Features

Performance

Updating cache takes certain overhead, but that would depend on how complicated your datatype in the first place, so I'd suggest measuring if you're not sure.

During reading cache all that happens is reading blobls from sqlite/decoding as JSON, and mapping them onto your target datatype, so the overhead depends on each of these steps.

It would almost certainly make your program faster if your computations take more than several seconds.

You can find some of my performance tests in benchmarks/ dir, and the tests themselves in src/cachew/tests/marshall.py.

Using

See docstring for up-to-date documentation on parameters and return types. You can also use extensive unit tests as a reference.

Some useful (but optional) arguments of @cachew decorator:

Installing

Package is available on pypi.

pip3 install --user cachew

Developing

I'm using tox to run tests, and Github Actions for CI.

Implementation

Tips and tricks

Optional dependency

You can benefit from cachew even if you don't want to bloat your app's dependencies. Just use the following snippet:

def mcachew(*args, **kwargs):
    """
    Stands for 'Maybe cachew'.
    Defensive wrapper around @cachew to make it an optional dependency.
    """
    try:
        import cachew
    except ModuleNotFoundError:
        import warnings

        warnings.warn('cachew library not found. You might want to install it to speed things up. See https://github.com/karlicoss/cachew')
        return lambda orig_func: orig_func
    else:
        return cachew.cachew(*args, **kwargs)

Now you can use @mcachew in place of @cachew, and be certain things don't break if cachew is missing.

Settings

cachew.settings exposes some parameters that allow you to control cachew behaviour:

Updating this readme

This is a literate readme, implemented as a Jupiter notebook: README.ipynb. To update the (autogenerated) README.md, use generate-readme script.