Home

Awesome

recs-at-resonable-scale

Recommendations at "Reasonable Scale": joining dataOps with deep learning recSys with Merlin and Metaflow (blog)

Overview

February 2023: aside from behavioral testing, the ML pipeline is now completed. A blog post on the NVIDIA Medium was just published!

This project is a collaboration with the Outerbounds, NVIDIA Merlin and Comet teams, in an effort to release as open source code a realistic data and ML pipeline for cutting edge recommender systems "that just works". Anyone can cook do great ML, not just Big Tech, if you know how to pick and choose your tools.

TL;DR: (after setup) a single ML person is able to train a cutting edge deep learning model (actually, several versions of it in parallel), test it and deploy it without any explicit infrastructure work, without talking to any DevOps person, without using anything that is not Python or SQL.

As a use case, we pick a popular RecSys challenge, user-item recommendations for the fashion industry: given the past purchases of a shopper, can we train a model to predict what he/she will buy next? In the current V1.0, we target a typical offline training, cached predictons setup: we prepare in advance the top-k recommendations for our users, and store them in a fast cache to be served when shoppers go online.

Our goal is to build a pipeline with all the necessary real-world ingredients:

At a quick glance, this is what we are building:

Recs at reasonable scale

For an in-depth explanation of the philosophy behind the approach, please check the companion blog post or watch our NVIDIA Summit keynote.

If you like this project please add a star on Github here and check out / share / star the RecList package.

Quick Links

This project builds on our open roadmap for "MLOps at Resonable Scale", automated documentation of pipelines, rounded evaluation for RecSys:

Pre-requisites

The code is a self-contained, end-to-end recommender project; however, since we leverage best-in-class tools, some preliminary (one time) setup is required. Please make sure the requirements are satisfied, depending on what you wish to run and on what you are already using - roughly in order of ascending complexity:

The basics: Metaflow, Snowflake and dbt

A Snowflake account is needed to host the data, and a working Metaflow setup is needed to run the flow on AWS GPUs if you wish to do so:

Please note that while the current implementation focused on Metaflow on AWS (with Batch), the exact same code (with a change in decorator!) would work in any kubernetes-based infrastructure (or Azure!). For the same reasons, Snowflake can be replaced with other warehouses leaving the main result unchanged: an end-to-end, scalable, production-ready pipeline for deep learning recommendations.

Adding experiment tracking

Adding PaaS deployment

Adding a Streamlit app for error analysis

A note on containers

At the moment of writing, Merlin does not have an official ECR, so we pulled nvcr.io/nvidia/merlin/merlin-tensorflow:22.11 and slightly changed the entry point to work with Metaflow / AWS Batch. The docker folder contains the relevant files - the current flow uses a public ECR repository (public.ecr.aws/outerbounds/merlin-reasonable-scale:22.11-latest) we prepared on our AWS when running training in BATCH; if you wish to use your own ECR or the repo above becomes unavailable for whatever reason, you can just change the relevant image parameter in the flow (note: you need to register for a free NVIDIA account first to be able to pull from nvcr).

Setup

We recommend using python 3.9 for this project.

Virtual env

Setup a virtual environment with the project dependencies:

Note that if you never plan on running Merlin's training locally, but only through Metaflow + AWS Batch, you can avoid installing merlin and tensorflow libraries.

NOTE: if you plan on using the Streamlit app (above) make sure to pip install also the requirements_app.txt in the app folder.

Inside src, create a version of the local.env file named only .env (do not commit it!), and fill its values:

VARIABLETYPE (DEFAULT)MEANING
SF_USERstringSnowflake user name
SF_PWDstringSnowflake password
SF_ACCOUNTstringSnowflake account
SF_DBstringSnowflake database
SF_ROLEstringSnowflake role to run SQL
SF_WAREHOUSEstringSnowflake warehouse to run SQL
EN_BATCH0-1 (0)Enable cloud computing for Metaflow
COMET_API_KEYstringComet ML api key
EXPORT_TO_APP0-1 (0)Enable exporting predictions for inspections through Streamlit
SAVE_TO_CACHE0-1 (0)Enable storing predictions to an external cache for serving. If 1, you need to deploy the AWS Lambda (see above) before running the flow

Load data into Snowflake

The original dataset is from the H&M data challenge.

Once you run the script, check your Snowflake for the new tables:

Raw tables in Snowflake

dbt

After the data is loaded, we use dbt as our transformation tool of choice. While you can run dbt code as part of a Metaflow pipeline, we keep the dbt part separate in this project to simplify the runtime component: it will be trivial (as shown here for example) to orchestrate the SQL code within Metaflow if you wish to do so. After the data is loaded in Snowflake:

Check your Snowflake for the new tables created by dbt:

Dbt tables

In particular, the table "EXPLORATION_DB"."HM_POST"."FILTERED_DATAFRAME" represents a dataframe in which user, article and transaction data are all joined together - the Metaflow pipeline will read from this table, leveraging the pre-processing done at scale through dbt and Snowflake.

How to run the entire project

Run the flow

Once the above setup steps are completed, you can run the flow:

At the end of the flow, you can inspect the default DAG Card with python my_merlin_flow.py card view get_dataset:

Metaflow card

For an intro to DAG cards, please check our NeurIPS 2021 paper.

Results

If you run the flow with the full setup, you will end up with:

Experiment dashboard

If you have set EXPORT_TO_APP=1 (and completed the setup), you can also visualize predictions using a Streamlit app that:

Cd into the app folder, and run streamlit run pred_inspector.py (make sure Metaflow envs have been set, as usual). You can filter for product type of the target item and use text-to-image search to sort items (try for example with "jeans" or "short sleeves").

Debugging app

Where to go from here (either with us or by yourself)?

Q&A

Acknowledgements

Main Contributors:

Special thanks:

License

All the code in this repo is freely available under a MIT License, also included in the project.