Home

Awesome

charity_recommender

A Kaggle Data Science for Good challenge

NYC Charity Recommender Business Objective:

Data Acquisition:

Data Preparation

Address cold start for recommender systems

Localize the dataset to NYC - Addressing the coldstart.

Frequency:

Comparing Similarity Matrices on an Implicit Dataset

Modeling

The intution behind using a collaborative filtering recommender system on an implicit dataset

http://datameetsmedia.com/an-overview-of-recommendation-systems/

By comparing the distance between donors donation history, we can assume from other similar the donor donation histories that they may want to donate to a charity that someone with a similar donation history has donated to. Using B's filled out donor history as a reference for possible projects subsequent donations may be on a list of recommendations. For example, if A and B both choose to eat pizza and ride a bike, then chances are if B likes sports drinks that A will also like sports drinks, so why not recommend A sports drinks as well?

Evaluation

In order to evaluate this project, it needs to be deployed.

Deploy

In the real world

Donor preferences do not always exactly, instead use their distances by measuring similarity. By applying the SurPRISE library we don't have to do all the Matrix Factorization mathematics that goes into generating the distances between donor preference matrices and instead return a list of projects they may choose to donate to.

In this project, five algorithms were used to test which recommendations would best suit the donor based on implicit donation history. These five algorithms use the Grid Search Cross Validation (GridSearchCV) to compare Mean Absolute Error and Root Mean Squared Error between which yields the best error metrics and time to compute recommendations.

Using the three algorithms with best results are listed here:

Used GridSearchCV on Matrix Factorization techniques to calculate "distance maetrics". It turns out SVD++ has the lowest Mean Absolute error rate. A boxplot visualization is below:

The matrix was sparse, did not use the CoClustering or Normal Predictor. The CoClustering Algorithm works best by comparing a more heavily populated matrix, the resulting recommendation error metrics didn't make sense. Because the implicit dataset is not normally distributed, the NormalPredictor achieved lower scores here.

Error metrics don't quite work when using recommender systems because error metrics are used for predictions. When you give someone a list of recommendations and they choose one of them, it does not necessarily mean they didn't like the other ones. They may like them all the same amount, they may not see the item or project they would have most wanted to donate to. The challenge in interpreting error metrics comes where the algorithms are predicting based on history to project into the future what a person may like, but without many more datapoints and feedback, a recommendation list will have to be tested in live trials.

According to the Error metrics SVD++ yielded the best scores for error metrics results and it is most geared towards the implicit feedback DataSet. The NNMF is a close second in my opinion. The Google Colab GUI returns a list of 10 projects that a user may like to choose from next. It will keep track of the new users preferences and compare the new user DataSet to the rest of the New York Donor DataSet, too.

My next steps