Awesome

PyPI - Python Version GitHub repo size GitHub

Overview

DaisyRec is a Python toolkit dealing with rating prediction and item ranking issue.

The name DAISY (roughly :) ) stands for Multi-Dimension fAIrly compArIson for recommender SYstem. The whole framework of Daisy is showed below:

Make sure you have a CUDA enviroment to accelarate since these deep-learning models could be based on it.

We will consistently update this repo.

Datasets

You can download experiment data, and put them into the data folder. All data are available in links below:

How to run

Make sure running command python setup.py build_ext --inplace to compile dependent extensions before running the other code. After that, you will find file *.so or *.pyd file generated in path daisy/model/
In order to reproduce results, you need to run python data_generator.py to create experiment_data folder with certain public dataset listed in our paper. If you just want to research one certain dataset, you need to modify the code in data_generator.py to indicate your demands and let this code yield train and test datasets as you want. In the default situation, data_generator.py will generate all kinds of datasets (raw data, 5-core data and 10-core data) with different data splitting methods, including tloo, loo, tfo and fo. The meaning of these split methods will be explained in the Important Commands of README.

There are seperate codes for validation and test, and they are stored in the folders of nested_tune_kit and test_kit, respectively. Each of the code in these folders should be moved into the root path, just the same directory as data_generator.py, so as to successfully run these code. Furthermore, if you have an IDE toolkit, you can simply set your work path and run in any folder path.

The validation dataset is used for parameter tuning, so we provide split_validation interfact inside the code in the nested_tune_kit folder. Further and more detail parameter settings information about validation split method is depicted in daisy/utils/loader.py. After finishing validation, the results will be stored in the automatically generated folder tune_log/.
Based on the best parameter determined by the validation, run the test code that you moved into the root path before and the results will be stored in the automatically generated folder res/.

Examples to run:

Taking the following case as an example: if we want to reproduce the top-20 results for BPR-MF on ML-1M-10core dataset.

Assume we have already run data_generator.py and get the training and test datasets by tfo (i.e., time-aware split by ratio method). We should get files named train_ml-1m_10core_tfo.dat, test_ml-1m_10core_tfo.dat in ./experiment_data/.
The whole procedure contains validation and test. Therefore, we first need to run hp_tune_pair_mf.py to get the best parameter settings. Besides, we may change the parameter search space in the hp_tune_pair_mf.py. Command to run:

python hp_tune_pair_mf.py --dataset=ml-1m --prepro=10core --val_method=tfo --test_method=tfo --topk=20 --loss_type=BPR --sample_method=uniform --gpu=0

After finishing step 2, we will get the best paramter settings from tune_log/. Then we can run the test code by following the command as below:

python run_pair_mf.py --dataset=ml-1m --prepro=10core --test_method=tfo --topk=20 --loss_type=BPR --num_ng=2 --factors=34 --epochs=50 --lr=0.0005 --lamda=0.0016 --sample_method=uniform --gpu=0

More details of arguments are available in help message, try:

python run_pair_mf.py --help

Once step 3 terminated, we can obtain the results w.r.t. top-20 from the dynamically generated result file ./res/ml-1m/10core_tfo_pairmf_BPR_uniform.csv

More Ranking Results

More ranking results for different methods on different datasets across various settings of top-N (N=1,5,10,20,30) are available in the file of ranking_results.md.

Important Commands

The description of all common parameter settings used by code inside examples are listed below:

Commands	Description on Commands	Choices	Description on Choices
dataset	the selected datasets	ml-100k;<br>ml-1m;<br>ml-10m;<br>ml-20m;<br>lastfm;<br>bx;<br>amazon-cloth;<br>amazon-electronic;<br>amazon-book;<br>amazon-music;<br>epinions;<br>yelp;<br>citeulike;<br>netflix	all choices are the names of datasets
prepro	the data pre-processing method	origin;<br>Ncore	'origin' means using the raw data; <br>'Ncore' means only preserving users and items that have interactions more than N. Notice N could be any integer value
val_method<br>test_method	train-validation splitting;<br>train-test splitting	ufo<br>fo<br>tfo<br>loo<br>tloo<br>cv	split-by-ratio-with-user-level<br>split-by-ratio<br>time-aware split-by-ratio<br>leave one out<br>time-aware leave one out<br>cross validation (only apply to val_method)
topk	the length of recommendation list
test_size	ratio of test set size
fold_num	the number of fold used for validation (only apply to 'cv', 'fo').
cand_num	the number of candidate items used for ranking
sample_method	negative sampling method	uniform<br>item-ascd<br>item-desc	uniformly sampling;<br>sampling popular items with low rank;<br>sampling popular item with high rank
num_ng	the number of negative samples