Home

Awesome

Recipe Cuisine Type and Rating Prediction with Recipes from food52.com and epicuirous.com

Goal

Our overarching goal for this project is to make a statement on whether the name, ingredients/recipe or general keywords from popular American food websites can help us predict if the specific recipe's rating (on a scale 0 to 4), or if the recipe belongs to a certain cuisine type (e.g. American, European, Asian, South-American, Middle-Easter/African, or Unknown).

ETL

Web Scrapping Our data were scrapped from all available recipes from two popular recipe website (food52.com and epicurious.com), from each recipe, we pulled the name of the recipe, ingredients, keywords, cuisine type and rating from each of the recipes.

Cleaning/vectorization None-alpha-numeric symbols and common stop-words were removed from the text (recipe name, keyword, ingredients), and transformed into a vectorized format with Count Vectorization.

<p align="center"> <img src="project3_visuals/Screen Shot 2018-10-12 at 2.42.02 PM.png" title="recipe corpus for each of the cuisine types"> </p> The graph showed a clear class imbalance, as most of the cusines were American / Unknown; therefore, we had to take measures to oversample our minority class. <p align="center"> <img src="project3_visuals/similarityingredients.png" title="compare cosine similarity of the cuisine corpus"> </p> The heat map showed that most of the corpora in each of our cuisine types are distinctive from each other. The corpus similarity was calculated with TFIDF vectorization and the similarity score was calculated with Cosine Similarity.

Extensive grid search to fine-tune our model accuracy

we conducted an extensive grid search to find the best model with the most optimal hyper-parameter. The algorithm searched included Random Forest, K-Nearest Neighbor, Support Vector Classifier, XGBoost Classifier. Our output was a multi-class/multilayer output (meaning the output is a vector, so we mapped vector to vector in our prediction), this was a challenging task considering the high dimensionality of our matrix.

<p align="left"> <img src="project3_visuals/Screen Shot 2018-10-12 at 2.43.11 PM.png" title="snapshot gridsearch "> </p>

Prediction Model 1: Using name and ingredients to predict cuisine type

Category keys: American=0, Europe =1, Asian=2, Middle-eastern/african =3, unknown=4, south american =5

BaggingClassifier(DecisionTreeClassifier(criterion='gini', max_depth=100), n_estimators=150)

<p align="left"> <img src="project3_visuals/Screen Shot 2018-10-12 at 2.42.51 PM.png" title="confusion matrix "> </p>

Our model predicted that most recipes belong in the 'Unknown' category.

Prediction Model 2: Using Name, keyword and recipe to predict rating

Rating Scale: 0-4

RandomForestClassifier(n_estimators=100, max_depth= 100)

<p align="left"> <img src="project3_visuals/Screen Shot 2018-10-12 at 2.43.59 PM.png" title="confusion matrix"> </p> Our model made mostly true-positive predictions on recipes with 0 rating in which the recipe actually had a 0 rating. On the other hand, the model also made a higher proportion of 0 rating predictions on a recipe that had an actual rating of 3.

Model Performance Demonstration

Example 1: Predict recipe cuisine type

recipe link: https://www.allrecipes.com/recipe/13443/harira/

Predicted cuisine type: *African_ / Actual cuisine type: Middle Eastern / African

Input and prediction snapshot:

<p align="left"> <img src="project3_visuals/Screen Shot 2018-10-12 at 2.43.29 PM.png" title="cusine type prediction demo 1"> </p>

Example 2: Predict recipe rating

recipe link: http://allrecipes.asia/recipe/4911/saag-masoor-dal--indian-dhal-with-spinach-.aspx

Predicted rating: 4 / Actual rating: *4

Input and prediction snapshot:

<p align="left"> <img src="project3_visuals/Screen Shot 2018-10-12 at 2.44.25 PM.png" title="cusine type prediction demo 1"> </p>

Conclusion:

Our model was able to make predictions about cuisine type and rating; however, understanding the limitations and the unique challenges of our dataset would help us navigate more ways to find the best algorithm and the best vectorization method.

Limitation:

Next Step: