Awesome
Overview
DaisyRec is a Python toolkit dealing with rating prediction and item ranking issue.
The name DAISY (roughly :) ) stands for Multi-Dimension fAIrly compArIson for recommender SYstem. The whole framework of Daisy is showed below:
<img src="pics/DiasyRec.png" align="center" width="75%" style="margin: 0 auto">Make sure you have a CUDA enviroment to accelarate since these deep-learning models could be based on it.
We will consistently update this repo.
<!-- A more professional daisyRec could be checked in `dev` branch -->Datasets
You can download experiment data, and put them into the data
folder.
All data are available in links below:
- MovieLens 100K
- MovieLens 1M
- MovieLens 10M
- MovieLens 20M
- Netflix Prize Data
- Last.fm
- Book Crossing
- Epinions
- CiteULike
- Amazon-Book
- Amazon-Electronic
- Amazon-Cloth
- Amazon-Music
- Yelp Challenge
How to run
-
Make sure running command
python setup.py build_ext --inplace
to compile dependent extensions before running the other code. After that, you will find file *.so or *.pyd file generated in pathdaisy/model/
-
In order to reproduce results, you need to run
python data_generator.py
to createexperiment_data
folder with certain public dataset listed in our paper. If you just want to research one certain dataset, you need to modify the code indata_generator.py
to indicate your demands and let this code yield train and test datasets as you want. In the default situation,data_generator.py
will generate all kinds of datasets (raw data, 5-core data and 10-core data) with different data splitting methods, includingtloo
,loo
,tfo
andfo
. The meaning of these split methods will be explained in theImportant Commands
ofREADME
.
- There are seperate codes for validation and test, and they are stored in the folders of
nested_tune_kit
andtest_kit
, respectively. Each of the code in these folders should be moved into the root path, just the same directory asdata_generator.py
, so as to successfully run these code. Furthermore, if you have an IDE toolkit, you can simply set your work path and run in any folder path.
-
The validation dataset is used for parameter tuning, so we provide split_validation interfact inside the code in the
nested_tune_kit
folder. Further and more detail parameter settings information about validation split method is depicted indaisy/utils/loader.py
. After finishing validation, the results will be stored in the automatically generated foldertune_log/
. -
Based on the best parameter determined by the validation, run the test code that you moved into the root path before and the results will be stored in the automatically generated folder
res/
.
Examples to run:
<!-- What should I do with daisyRec if I want to reproduce the top-20 result published like *BPR-MF* with ML-1M-10core dataset(When tuning, we fix sample method as uniform method).-->Taking the following case as an example: if we want to reproduce the top-20 results for BPR-MF on ML-1M-10core dataset.
<!-- 1. Assume you have already run `data_generator.py` and get `tfo` (time-aware split by ratio method) test dataset, you must get files named `train_ml-1m_10core_tfo.dat`, `test_ml-1m_10core_tfo.dat` in `./experiment_data/`. **This step is essential!** -->-
Assume we have already run
data_generator.py
and get the training and test datasets bytfo
(i.e., time-aware split by ratio method). We should get files namedtrain_ml-1m_10core_tfo.dat
,test_ml-1m_10core_tfo.dat
in./experiment_data/
. -
The whole procedure contains validation and test. Therefore, we first need to run
hp_tune_pair_mf.py
to get the best parameter settings. Besides, we may change the parameter search space in thehp_tune_pair_mf.py
. Command to run:
python hp_tune_pair_mf.py --dataset=ml-1m --prepro=10core --val_method=tfo --test_method=tfo --topk=20 --loss_type=BPR --sample_method=uniform --gpu=0
<!--Since all reasonable parameter search scope was fixed in the code, there is no need to parse more arguments-->
<!--3. After you finished step 2 and just get the best parameter settings from `tune_log/` or you just wanna reproduce the results provided in paper, you can run the following command to achieve it.-->
- After finishing step 2, we will get the best paramter settings from
tune_log/
. Then we can run the test code by following the command as below:
python run_pair_mf.py --dataset=ml-1m --prepro=10core --test_method=tfo --topk=20 --loss_type=BPR --num_ng=2 --factors=34 --epochs=50 --lr=0.0005 --lamda=0.0016 --sample_method=uniform --gpu=0
More details of arguments are available in help message, try:
python run_pair_mf.py --help
- Once step 3 terminated, we can obtain the results w.r.t. top-20 from the dynamically generated result file
./res/ml-1m/10core_tfo_pairmf_BPR_uniform.csv
More Ranking Results
More ranking results for different methods on different datasets across various settings of top-N (N=1,5,10,20,30) are available in the file of ranking_results.md
.
Important Commands
The description of all common parameter settings used by code inside examples
are listed below:
Commands | Description on Commands | Choices | Description on Choices |
---|---|---|---|
dataset | the selected datasets | ml-100k;<br>ml-1m;<br>ml-10m;<br>ml-20m;<br>lastfm;<br>bx;<br>amazon-cloth;<br>amazon-electronic;<br>amazon-book;<br>amazon-music;<br>epinions;<br>yelp;<br>citeulike;<br>netflix | all choices are the names of datasets |
prepro | the data pre-processing method | origin;<br>Ncore | 'origin' means using the raw data; <br>'Ncore' means only preserving users and items that have interactions more than N. Notice N could be any integer value |
val_method<br>test_method | train-validation splitting;<br>train-test splitting | ufo<br>fo<br>tfo<br>loo<br>tloo<br>cv | split-by-ratio-with-user-level<br>split-by-ratio<br>time-aware split-by-ratio<br>leave one out<br>time-aware leave one out<br>cross validation (only apply to val_method) |
topk | the length of recommendation list | ||
test_size | ratio of test set size | ||
fold_num | the number of fold used for validation (only apply to 'cv', 'fo'). | ||
cand_num | the number of candidate items used for ranking | ||
sample_method | negative sampling method | uniform<br>item-ascd<br>item-desc | uniformly sampling;<br>sampling popular items with low rank;<br>sampling popular item with high rank |
num_ng | the number of negative samples |