Home

Awesome

The Post-Modern Stack

Joining the modern data stack with the modern ML stack

Overview

As part our TDS series on MLOps, our blog post shows how a post-modern stack works, by deconstructing (see the pun?) our original YDNABB repo into the few fundamental pieces owning the actual compute: a data warehouse for dataOps, and Metaflow on AWS for MLOps. A quick, high-level walk-through of the stack can be found in our intro video:

YouTube intro video

As a use case, we pick a popular RecSys challenge, session-based recommendation: given the interactions between a shopper and some products in a browsing session, can we train a model to predic what the next interaction will be? The flow is powered by our open-source Coveo Data Challenge dataset - as model, we train a vanilla LSTM, a model just complex enough to make good use of cloud computing. At a quick glance, this is what we are building:

The post-modern stack

As usual, we show a working, end-to-end, real-world flow: while you can run it locally with few thousands sessions to get the basics, we suggest you to use the MAX_SESSIONS variable to appreciate how well the stack scales - with no code changes - as millions of events are pushed to the warehouse.

For an in-depth explanation of the philosophy behind the approach, please check the companion blog post, and the previous episodes / repos in the series.

Pre-requisites

The code is a self-contained recommender project; however, since we leverage best-in-class tools, some preliminary setup is required. Please make sure the requirements are satisfied, depending on what you wish to run - roughly in order of ascending complexity:

The basics: Metaflow, Snowflake and dbt

A Snowflake account is needed to host the data, and a working Metaflow + dbt setup is needed to run the flow; we strongly suggest to run Metaflow on AWS (as it is the intended setup), but with some minor modifications you should be able to run the flow with a local store as well.

Adding experiment tracking

Adding PaaS deployment

Adding dbt cloud

Setup

Virtual env

Setup a virtual environment with the project dependencies:

NOTE: the current version of RecList has some old dependencies which may results in some (harmless) pip conflicts - conflicts will disappear with the new version, coming out soon.

Create a local version of the local.env file named only .env (do not commit it!), and make sure to fill its values properly:

VARIABLETYPEMEANING
SF_USERstringSnowflake user name
SF_PWDstringSnowflake password
SF_ACCOUNTstringSnowflake account
SF_DBstringSnowflake database
SF_SCHEMAstring (suggested: POST_MODERN_DATA_STACK)Snowflake schema for raw and transformed data
SF_TABLEstring (COVEO_DATASET_RAW)Snowflake table for raw data
SF_ROLEstringSnowflake role to run SQL
APPLICATION_API_KEYuuid (474d1224-e231-42ed-9fc9-058c2a8347a5)Organization id to simulate a SaaS company
MAX_SESSIONSint (1000)Number of raw sessions to load into Snowflake (try first running the project locally with a small number)
EN_BATCH0-1 (0)Enable/disable cloud computing for @batch steps in Metaflow (try first running the project locally)
COMET_API_KEYstringComet ML api key
DBT_CLOUD0-1 (0)Enable/disable running dbt on the cloud
SAGEMAKER_DEPLOY0-1 (1)Enable/disable deploying the model artifact to a Sagemaker endpoint
DBT_ACCOUNT_IDintdbt cloud account id (you can find it in the dbt cloud URL)
DBT_PROJECT_IDintdbt cloud project id (you can find it in the dbt cloud URL)
DBT_JOB_IDintdbt cloud job id (you can find it in the dbt cloud URL)
DBT_API_KEYstringdbt cloud api key

Load data into Snowflake

Original datasets are from the Coveo SIGIR Data Challenge. To save you from downloading the original data dump and dealing with large text files, we re-used the abstraction over the data provided by RecList. If you run upload_to_snowflake.py in the upload folder from your laptop as a one-off script, the program will download the Data Challenge dataset and dump it to a Snowflake table that simulates the append-only log pattern. This allows us to use dbt and Metaflow to run a realistic ELT and ML code over real-world data.

Once you run the script, check your Snowflake for the new schema/table:

Raw table in Snowflake

If you wish to see how a data ingestion pipeline works (i.e. an endpoint streaming into Snowflake real-time, individual events, instead of a bulk upload), we open-sourced a serverless pipeline as well.

dbt

While we will run dbt code as part of Metaflow, it is good practice to try and see if everything works from a stand-alone setup first. To run and test the dbt transformations, just cd into the dbt folder and run dbt run --vars '{SF_SCHEMA: POST_MODERN_DATA_STACK, SF_TABLE: COVEO_DATASET_RAW}', where the variables reflect the content of your .env file (you can also run dbt test, if you like).

Once you run dbt, check your Snowflake for the views:

Views in Snowflake

The DBT_CLOUD variable (see above) controls whether transformations are run from within the flow folder, or from a dbt cloud account, by using dbt API to trigger the transformation on the cloud platform. If you want to leverage dbt cloud, make sure to manually create a job on the platform, and then configure the relevant variables in the .env file. In our tests, we used the exact same .sql and .yml files that you find in this repository:

<img src="/images/dbt_cloud.png" height="250">

Please note that instead of having a local dbt folder, you could have your dbt code in a Github repo and then either clone it using Github APIs at runtime, or import it in dbt cloud and use the platform to run the code base.

How to run (a.k.a. the whole enchilada)

Run the flow

Once the above setup steps are completed, you can run the flow:

Results

If you run the fully-featured flow (i.e. SAGEMAKER_DEPLOY=1) with the recommended setup, you will end up with:

If you log in into your AWS SageMaker interface, you should find the new endpoint for next event prediction available for inference:

aws sagemaker UI

If you run the flow with dbt cloud, you will also find the dbt run in the history section on the cloud platform, easily identifiable through the flow id and user.

dbt run history

BONUS: RecList and Metaflow cards

The project includes a (stub of a) custom DAG card showing how the model is performing according to RecList, our open-source framework for behavioral testing. We could devote an article / paper just to this (as we actually did recently!); you can visualize it with METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view test_model --id recCard at the end of your run. No matter how small, we wanted to include the card/test as a reminder of how important is to understand model behavior before deployment. Cards are a natural UI to display some of the RecList information: since readable, shareable (self-)documentation is crucial for production, RecList new major release will include out-of-the-box support for visualization and reporting tools: reach out if you're interested!

As a bonus bonus feature (thanks Valay for the snippet!), only when running with the dbt core setup, the (not-production-ready) function get_dag_from_manifest will read the local manifest file and produce a dictionary compatible with Metaflow Card API. If you type METAFLOW_PROFILE=metaflow AWS_PROFILE=tooso AWS_DEFAULT_REGION=us-west-2 python my_dbt_flow.py card view run_transformation --id dbtCard at the end of a successful run, you should see a card displaying the dbt card as a Metaflow card, as in the image below:

dbt card on Metaflow

We leave to the reader (and / or to future iterations) to explore how to combine dbt, RecList and other info into a custom, well-designed card!

What's next?

Of course, the post-modern stack can be further expanded or improved in many ways. Without presumption of completeness, these are some ideas to start:

Is this the only way to run dbt in Metaflow? Of course not - in particular, you could think of writing a small wrapper around a flow and a dbt-core project that creates individual Metaflow steps corresponding to individual dbt steps, pretty much like suggested here for another orchestrator. But this is surely a story for another repo / time ;-)

Acknowledgements

Special thanks to Sung Won Chung from dbt Labs, Hugo Bowne-Anderson, Gaurav Bhushan, Savin Goyal, Valay Dave from Outerbounds, Luca Bigon, Andrea Polonioli and Ciro Greco from Coveo.

If you liked this project and the related article, please take a second to add a star to this and our RecList repository!

Contributors:

License

All the code in this repo is freely available under a MIT License, also included in the project.