Home

Awesome

Kedro Auto Catalog

<img src="https://user-images.githubusercontent.com/22648375/219141193-22fdf6c4-a633-4f64-b7ee-01474a0f7dfb.png" width="250" align=right>

A configurable version of the built in kedro catalog create cli. Default types can be configured in the projects settings.py, to get these types rather than MemoryDataSets.

PyPI - Version PyPI - Python Version


Table of Contents

Installation

pip install kedro-auto-catalog

Configuration

Configure the project defaults in src/<project_name>/settings.py with this dict.

AUTO_CATALOG = {
    "directory": "data",
    "subdirs": ["raw", "intermediate", "primary"],
    "layers": ["raw", "intermediate", "primary"],
    "default_extension": "parquet",
    "default_type": "pandas.ParquetDataSet",
}

Usage

To auto create catalog entries for the __default__ pipeline, run this from the command line.

kedro auto-catalog -p __default__

If you want a reminder of what to do, use the --help.

❯ kedro auto-catalog --help❯
Usage: kedro auto-catalog [OPTIONS]

  Create Data Catalog YAML configuration with missing datasets.

  Add configurable datasets to Data Catalog YAML configuration file for each
  dataset in a registered pipeline if it is missing from the `DataCatalog`.

  The catalog configuration will be saved to
  `<conf_source>/<env>/catalog/<pipeline_name>.yml` file.

  Configure the project defaults in `src/<project_name>/settings.py` with this
  dict.

Options:
  -e, --env TEXT       Environment to create Data Catalog YAML file in.
                       Defaults to `base`.
  -p, --pipeline TEXT  Name of a pipeline.  [required]
  -h, --help           Show this message and exit.

Example

Using the kedro-spaceflights example, running kedro auto-catalog -p __default__ yields the following catalog in conf/base/catalog/__default__.yml

X_test:
  filepath: data/X_test.pq
  type: pandas.ParquetDataSet
X_train:
  filepath: data/X_train.pq
  type: pandas.ParquetDataSet
y_test:
  filepath: data/y_test.parquet
  type: pandas.ParquetDataSet
y_train:
  filepath: data/y_train.parquet
  type: pandas.ParquetDataSet

subdirs and layers

If we use the example configuration with "subdirs": ["raw", "intermediate", "primary"] and "layers": ["raw", "intermediate", "primary"], it will convert any leading subdir/layer in your dataset name into a directory. If we change y_test to raw_y_test, it will put y_test.parquet in the raw directory, and in the raw layer.

raw_y_test:
  filepath: data/raw/y_test.parquet
  layer: raw
  type: pandas.ParquetDataSet

License

kedro-auto-catalog is distributed under the terms of the MIT license.