Home

Awesome

Comparison of Python pipeline packages: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX

This article compares open-source Python packages for pipeline/workflow development: Airflow, Luigi, Gokart, Metaflow, Kedro, PipelineX.

In this article, terms of "pipeline", "workflow", and "DAG" are used almost interchangeably.

Summary

PackageAirflowLuigiΒ Β Β GokartMetaflowKedroΒ Β Β PipelineX
Developer, MaintainerAirbnb, ApacheSpotifyM3NetflixQuantum-Black (McKinsey)Yusuke Minami
Wrapped packagesLuigiKedro, MLflow
Easiness/flexibility to define DAGπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Modularity of DAG definitionπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Unstructured data can be passed between tasksπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Built-in various data (file/database) existence check wrappersπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Built-in various data (file/database) operation (read/write) wrappersπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Modularity, reusability, testability of data operationπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Automatic resuming option by detecting the intermediate dataπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Force rerun of tasks by detecting parameter changeπŸ‘πŸ‘
Save parameters for experimentsπŸ‘πŸ‘πŸ‘πŸ‘
Parallel executionπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Distributed parallel execution with CeleryπŸ‘πŸ‘
Visualization of DAGπŸ‘πŸ‘πŸ‘πŸ‘πŸ‘πŸ‘
Execution status monitoring in GUIπŸ‘πŸ‘πŸ‘πŸ‘
Scheduling, Triggering in GUIπŸ‘
Notification to SlackπŸ‘πŸ‘

Airflow

https://github.com/apache/airflow

Released in 2015 by Airbnb.

Airflow enables you to define your DAG (workflow) of tasks in Python code (an independent Python module).

(Optionally, unofficial plugins such as dag-factory enables you to define DAG in YAML.)

Pros:

Cons:

Luigi

https://github.com/spotify/luigi

Released in 2012 by Spotify.

Luigi enables you to define your pipeline by child classes of Task with 3 class methods (requires, output, run) in Python code.

Pros:

Cons:

Gokart

https://github.com/m3dev/gokart

Released in Dec 2018 by M3.

Gokart works on top of Luigi.

Pros:

In addition to Luigi's advantages:

Cons:

Metaflow

https://github.com/Netflix/metaflow

Released in Dec 2019 by Netflix.

Metaflow enables you to define your pipeline as a child class of FlowSpec that includes class methods with step decorators in Python code.

Pros:

Cons:

Kedro

https://github.com/quantumblacklabs/kedro

Released in May 2019 by QuantumBlack, part of McKinsey & Company.

Kedro enables you to define pipelines using list of node functions with 3 arguments (func: task processing function, inputs: input data name (list or dict if multiple), outputs: output data name (list or dict if multiple)) in Python code (an independent Python module).

Pros:

Cons:

PipelineX:

https://github.com/Minyus/pipelinex

Released in Nov 2019 by a Kedro user (me).

PipelineX works on top of Kedro and MLflow.

PipelineX enables you to define your pipeline in YAML (an independent YAML file).

Pros:

In addition to Kedro's advantages:

Cons:

Platform-specific options

Argo

https://github.com/argoproj/argo

Uses Kubernetes to run pipelines.

Kubeflow Pipelines

https://github.com/kubeflow/pipelines

Works on top of Argo.

Oozie

https://github.com/apache/oozie

Manages Hadoop jobs.

Azkaban

https://github.com/azkaban/azkaban

Manages Hadoop jobs.

GitLab CI/CD

https://docs.gitlab.com/ee/ci/

References

Airflow

Luigi

Gokart

Metaflow

Kedro

PipelineX

Airflow vs Luigi

Inaccuracies

Please kindly let me know if you find anything inaccurate.

Pull requests for https://github.com/Minyus/Python_Packages_for_Pipeline_Workflow/blob/master/README.md are welcome.