Awesome
My Awesome Data Ops Resources
A curated list of data operations resources, focused for Cultural Heritage Organizations usage.
Books
-
The DataOps Cookbook A 135-page long book that describes the steip-by-step implmentation of Data Ops.
Papers and Blogs
ETL
Data Quality
Metadata
Pipeline Engineering
Data Ops Software
Data Pipeline Orchestration
- Airflow an open-source platform to programmatically author, schedule and monitor data pipelines.
- Apache Oozie an open-source workflow scheduler system to manage Apache Hadoop jobs.
- DBT (Data Build Tool) is a command line tool that enables data analysts and engineers to transform data in their warehouse more effectively.
- BMC Control-M a digital business automation solution that simplifies and automates diverse batch application workloads.
- DataKitchen a DataOps Platform that reduces analytics cycle time by monitoring data quality and providing automated support for the deployment of data and new analytics.
- Reflow Reflow is a system for incremental data processing in the cloud. Reflow enables scientists and engineers to compose existing tools (packaged in Docker images) using ordinary programming constructs.
- ElementL A current stealth company founded by ex-facebook director and graphQL co-creator Nick Schrock. Dagster Open Source.
- Astronomer.io Astronomer recently re-focused on Airflow support. They make it easy to deploy and manage your own Apache Airflow webserver, so you can get straight to writing workflows.
- Piperr.io Use Piperr’s pre-built data pipelines across enterprise stakeholders: From IT to Analytics, From Tech, Data Science to LoBs.
- Prefect Technologies Open-source data engineering platform that builds, tests, and runs data workflows.
- Genie Distributed Big Data Orchestration Service by Netflix
Testing and Production Quality
- ICEDQ software used to automate the testing of ETL/Data Warehouse and Data Migration.
- Naveego A simple, cloud-based platform that allows you to deliver accurate dashboards by taking a bottom-up approach to data quality and exception management.
- DataKitchen a DataOps Platform that improves data quality by providing lean manufacturing controls to test and monitor data.
- FirstEigen Automatic Data Quality Rule Discovery and Continuous Data Monitoring
- Great Expectations Great Expectations is a framework that helps teams save time and promote analytic integrity with a new twist on automated testing: pipeline tests. Pipeline tests are applied to data (instead of code) and at batch time (instead of compiling or deploy time).
- Enterprise Data Foundation Open-source enterprise data toolkit providing efficient unit testing, automated refreshes, and automated deployment.
Deployment Automation and Development Sandbox Creation
- Jenkins a ‘CI/CD’ tool used by software development teams to deploy code from development into production
- DataKitchen a DataOps Platform that supports the deployment of all data analytics code and configuration.
- Amaterasu is a deployment tool for data pipelines. Amaterasu allows developers to write and easily deploy data pipelines, and clusters manage their configuration and dependencies.
- Meltano aims to be a complete solution for data teams — the name stands for model, extract, load, transform, analyze, notebook, orchestrate — in other words, the data science lifecycle.
Data Science Model Deployment
- Domino accelerates the development and delivery of models with infrastructure automation, seamless collaboration, and automated reproducibility.
- Hydrosphere.io deploys batch Spark functions, machine-learning models, and assures the quality of end-to-end pipelines.
- Open Data Group a software solution that facilitates the deployment of analytics using models.
- ParallelM moves machine learning into production, automates orchestration, and manages the ML pipeline.
- Seldon streamlines the data science workflow, with audit trails, advanced experiments, continuous integration, and deployment.
- Metis Machine Enterprise-scale Machine Learning and Deep Learning deployment and automation platform for rapid deployment of models into existing infrastructure and applications.
- Datatron Automate deployment and monitoring of AI Models.
- DSFlowGo from data extraction to business value in days, not months. Build on top of open source tech, using Silicon Valley’s best practices.
- DataMo-Datmo tools help you seamlessly deploy and manage models in a scalable, reliable, and cost-optimized way.
- MLFlow An open source platform for the complete machine learning lifecycle from MapR.
- Studio.ML Studio is a model management framework written in Python to help simplify and expedite your model building experience.
- Comet.ML Comet.ml allows data science teams and individuals to automagically track their datasets, code changes, experimentation history and production models creating efficiency, transparency, and reproducibility.
- Polyaxon An open source platform for reproducible machine learning at scale.
- Missinglink.ai MissingLink helps data engineers streamline and automate the entire deep learning lifecycle.
- kubeflow The Machine Learning Toolkit for Kubernetes
- Vert.ai Models are the new code!