Awesome
Kedro Auto Catalog
<img src="https://user-images.githubusercontent.com/22648375/219141193-22fdf6c4-a633-4f64-b7ee-01474a0f7dfb.png" width="250" align=right>A configurable version of the built in kedro catalog create
cli. Default
types can be configured in the projects settings.py, to get these types rather
than MemoryDataSets
.
Table of Contents
Installation
pip install kedro-auto-catalog
Configuration
Configure the project defaults in src/<project_name>/settings.py
with this
dict.
AUTO_CATALOG = {
"directory": "data",
"subdirs": ["raw", "intermediate", "primary"],
"layers": ["raw", "intermediate", "primary"],
"default_extension": "parquet",
"default_type": "pandas.ParquetDataSet",
}
Usage
To auto create catalog entries for the __default__
pipeline, run this from the command line.
kedro auto-catalog -p __default__
If you want a reminder of what to do, use the --help
.
❯ kedro auto-catalog --help❯
Usage: kedro auto-catalog [OPTIONS]
Create Data Catalog YAML configuration with missing datasets.
Add configurable datasets to Data Catalog YAML configuration file for each
dataset in a registered pipeline if it is missing from the `DataCatalog`.
The catalog configuration will be saved to
`<conf_source>/<env>/catalog/<pipeline_name>.yml` file.
Configure the project defaults in `src/<project_name>/settings.py` with this
dict.
Options:
-e, --env TEXT Environment to create Data Catalog YAML file in.
Defaults to `base`.
-p, --pipeline TEXT Name of a pipeline. [required]
-h, --help Show this message and exit.
Example
Using the
kedro-spaceflights
example, running kedro auto-catalog -p __default__
yields the following
catalog in conf/base/catalog/__default__.yml
X_test:
filepath: data/X_test.pq
type: pandas.ParquetDataSet
X_train:
filepath: data/X_train.pq
type: pandas.ParquetDataSet
y_test:
filepath: data/y_test.parquet
type: pandas.ParquetDataSet
y_train:
filepath: data/y_train.parquet
type: pandas.ParquetDataSet
subdirs and layers
If we use the example configuration with "subdirs": ["raw", "intermediate", "primary"]
and "layers": ["raw", "intermediate", "primary"]
, it will convert
any leading subdir/layer in your dataset name into a directory. If we change y_test
to raw_y_test
, it will put y_test.parquet
in the raw
directory, and in the raw layer.
raw_y_test:
filepath: data/raw/y_test.parquet
layer: raw
type: pandas.ParquetDataSet
License
kedro-auto-catalog
is distributed under the terms of the MIT license.