Home

Awesome

Pypi Duck Flow : Get insights of your python project 🐍

This project is a collections of pipelines to get insights of your python project. It also serves as educational purpose (YouTube videos and blogs) to learn how to build data pipelines with Python, SQL & DuckDB. You can see the final result of the project in this live dashboard.

demo

The project is a monorepo composed of series in 3 parts :

Please refer to the CHANGELOG.md for the latest updates.

High level architecture

High level architecture

Development

Setup

The project requires :

There's also two devcontainers definitions for VSCode : one for Python, and one for NodeJS. Finally a Makefile is available to run common tasks.

Env & credentials

A .env file is required to run the project. You can copy the .env.example file and fill the required values.

DATABASE_NAME=duckdb_stats # duckdb database name
TABLE_NAME=pypi_file_downloads # output table name
S3_PATH=s3://my-s3-bucket # output s3 path
AWS_PROFILE=default # aws profile to use
GCP_PROJECT=my-gcp-project # GCP project to use
START_DATE=2023-04-01 # start date of the data to ingest
END_DATE=2023-04-03 # end date of the data to ingest
PYPI_PROJECT=duckdb # pypi project to ingest
GOOGLE_APPLICATION_CREDENTIALS=/path/to/my/creds # path to GCP credentials
motherduck_token=123123 # MotherDuck token
TIMESTAMP_COLUMN=timestamp # timestamp column name, use for partitions on S#
DESTINATION=local # destinations to push data to local will be local duckdb, md motherduck or s3 for s3.
TRANSFORM_S3_PATH_INPUT=s3://my-input-bucket/pypi_file_downloads/*/*/*.parquet # For transform pipeline, input source data
TRANSFORM_S3_PATH_OUTPUT=s3://my-output-bucket/ # For transform pipeline, output source if putting data to s3

Ingestion

Requirements

Run

Once you fill your .env file, do the following :

Transformation

Requirements

You can choose to push the data of the transform pipeline either to AWS S3 or to MotherDuck. Both pipelines rely on source data storing on AWS S3 (see Ingestion section for more details). You can use a public sample dataset for this part of the tutorial, which is located at s3://us-prd-motherduck-open-datasets/pypi/sample_tutorial/pypi_file_downloads/*/*/*.parquet For AWS S3, you would need :

Run

Fill your .env file with the following variables. Note that you can use the TRANSFORM_S3_PATH_INPUT value here below for the tutorial, it's a public bucket containing some sample data:

motherduck_token=123123 
TRANSFORM_S3_PATH_INPUT=s3://us-prd-motherduck-open-datasets/pypi/sample_tutorial/pypi_file_downloads/*/*/*.parquet 
TRANSFORM_S3_PATH_OUTPUT=s3://my-output-bucket/ 

You can then run the following commands :

Visualization - Dashboard

The visualization part is using Evidence framework to create a dashboard. It's a NodeJS project that uses the data from the transformation pipeline, stored on MotherDuck. You can also use the available MotherDuck shared database (including data from duckdb pypi project)

Accessing the shared MotherDuck database

To access the dataset, you only need to create a free account on MotherDuck, and then you can access the shared database by using the following ATTACH url, to be run in your DuckDB client (Python, CLI, etc.):

ATTACH 'md:_share/duckdb_stats/507a3c5f-e611-4899-b858-043ce733b57c'

Running the dashboard

To run the dashboard, you need to have NodeJS installed on your machine. You can then :