Home

Awesome

Bundle Recommendation

This project aims to provide new data sources for product bundling in real e-commerce platforms with the domains of Electronic, Clothing and Food. We construct three high-quality bundle datasets with rich meta information, particularly bundle intents, through a carefully designed crowd-sourcing task.

1. Worker Basic Information

Figure 1 shows the distribution of workers' age, education, country, occupation, gender and shopping frequency for the two batches. In particular, `Others' in the country distribution includes Argentina, Australia, Anguilla, Netherlands, Albania, Georgia, Tunisia, Belgium, Armenia, Guinea, Austria, Switzerland, Iceland, Lithuania, Egypt, Venezuela, Bangladesh, American Samoa, Vanuatu, Colombia, United Arab Emirates, Ashmore and Cartier Island, Estados Unidos, Wales, Turkey, Angola, Scotland, Philippines, Iran and Bahamas.

basic_information

<p align="center">Figure 1: Worker basic information in the first and second batches.</p>

2. Feedback from Workers

Figure 2 visualizes worker feedback towards our task, where (a) shows the feedback regarding the difficulty on identifying bundles and naming the corresponding intents for the two batches; and (b-c) depict the general feedback for the two batches.

<p align="center"> <img src="img/worker_feedback.png" width="90%" height="90%"> </p> <p align="center">Figure 2: Feedback from workers for the two batches.</p>

3. Bundle Detection

Source code

Parameter Settings

A grid search in {0.0001, 0.001, 0.01} is applied to find out the optimal settings for support and confidence, and both are set as 0.001 across the three domains.

4. Bundle Completion

Source code

Parameter Settings

The dimension d of item and bundle representations for all methods is 20. Grid search is adopted to find out the best settings for other key parameters. In particular, learning rate and regularization coefficient are searched in {0.0001, 0.001, 0.01}; the number of neighbors K in ItemKNN is searched in {10, 20, 30, 50}; the weight of KL divergence in VAE is searched in {0.001, 0.01, 0.1}; the hidden layer sizes are searched in {50,100,200}; The batch size is searched in {64, 128, 256}. The number of heads (i.e., n_heads) for TSF is searched in {1,2,4}. The number of layers for TSF is searched in {1,2,3}. The optimal parameter settings are shown in Table 1.

        Table 1: Parameter settings for bundle completion (d=20).

ElectronicClothingFood
ItemKNNequationequationequation
BPRMFequation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation
mean-VAEequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation
concat-VAEequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation
concat-CVAEequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation
TSFequation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation

5. Bundle Ranking

Source code

Parameter Settings

The dimension d of representations is set as 20. We apply a same grid search for , , and batch size as in bundle completion. Besides, the predictive layer D for AttList is searched from {20, 50, 100}; the node and message dropout rate for GCN and BGCN is searched in {0, 0.1, 0.3, 0.5}. As the training complexity for GCN and BGCN is quite high, we set the batch size as 2048 as suggested by the original paper. The optimal parameter settings are presented in Table 2. Note that the parameter settings for BGCN is the version without pre-training (i.e. ).

       Table 2: Parameter settings for bundle ranking (d=20).

ElectronicClothingFood
ItemKNNequationequationequation
BPRMFequation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation
DAMequation<br>equation<br>equationequation<br>equation<br>equationequation<br>equation<br>equation
AttListequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equation
GCNequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equation
BGCNequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equation

6. Bundle Generation Explanation

Source code

Please refer to code

Parameter Settings

For LSTM, BiLSTM and Transformer, the dimension of word embeddings is 300; learning rate is searched in {0.0001, 0.001, 0.01}; batch size is searched in {16, 32, 64}; the hidden size is searched in {128, 256, 512}; the number of heads (i.e., nhead) for Transformer is searched in the range of [1, 8] stepped by 1; the number of encoder/decoder layers is searched in {1, 2, 3, 4}. For the pre-trained models, i.e., BertGeneration, BART-base and T5-base, the maximum length in encoder is set to 512, and the maximum length in decoder is set to 64; learning rate is searched in {0.00002, 0.00005, 0.00007, 0.0001}; the number of epochs is searched in {3, 4, 5}. The optimal parameter settings are shown in Table 3.

  Table 3: Parameter settings for bundle generation explanation.

Optimal Parameter Settings
LSTMequation<br>equation
BiLSTMequation <br>equation
Transformerequation <br>equation
BertGenerationequation
BART-baseequation
T5-baseequation

7. Bundle Ranking Explanation

Source code

Parameter Settings

For RM, we apply a grid search in {0.0001, 0.001, 0.01, 0.1} for support and confidence; and a grid search in {1, 2, 3, 4} for lift to find out their optimal settings. For EFM, the regularization coefficients and are searched in the range of (0, 1] with a step of 0.1, while , and are searched in {0.0001, 0.001, 0.01, 0.1}; the total number of factors r is searched from {20, 50, 100}; the ratio of explicit factors is searched in a range of [0, 1] with a step of 0.1; the number of most cared features k in searched from [10, 100] with a step of 10. For PGPR and KGAT, we apply the same grid search for , batch size, the node and message dropout rate as in bundle ranking; the dimension of representations (d) is searched in {20, 50, 100}; the action space and the weight of entropy loss for PGPR are searched in {100, 200, 300} and {0.0001, 0.001, 0.01}, respectively; the sampling sizes at the 3 steps (i.e., equation) for PGPR are searched in {20, 25, 30}, {5, 10, 15} and {1}, respectively; and the regularization coefficient for KGAT is searched from {0.0001, 0.001, 0.01, 0.1}. The optimal parameter settings are shown in Table 4.

            Table 4: Parameter settings for bundle ranking explanation.

ElectronicClothingFood
RMequation<br>equation<br>equationequation<br>equation<br>equationequation<br>equation<br>equation
EFMequation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation
PGPRequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation<br>equation
KGATequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equationequation<br>equation<br>equation<br>equation<br>equation<br>equation

8. Bundle Auto-Naming

Source code

Please refer to code

Parameter Settings

For the ImageCap model, the maximum length in decoder is set to 64; learning rate is searched in {0.00002, 0.00005, 0.00007, 0.0001}; the number of epochs is searched in {3, 4, 5}. The optimal parameter settings are shown in Table 5.

 Table 5: Parameter settings for bundle generation explanation.

Optimal Parameter Settings
ImageCapequation

9. Statistics of Datasets

        Table 6: Statistics of datasets.

ElectronicClothingFood
#Users888965879
#Items349944873767
#Sessions114511811161
#Bundles175019101784
#Intents142214661156
Average Bundle Size3.523.313.58
#User-Item Interactions616563266395
#User-Bundle Interactions175319121785
Density of User-Item Interactions0.20%0.15%0.19%
Density of User-Bundle Interactions0.11%0.10%0.11%

10. Descriptions of Data Files

Under the 'dataset' folder, there are three domains, including clothing, electronic and food. Each domain contains the following 9 data files.

<p align="center">Table 7: The descriptions of the data files.</p>
File NameDescriptions
user_item_pretrain.csvThis file contains the user-item interactions aiming to obtain the pre-trained item representations via BPRMF for model initialization.<br> This is a tab separated list with 3 columns: user ID | item ID | timestamp |<!--<br>The user IDs are the ones used in the `user_bundle.csv` and `user_item.csv` data sets. The item IDs are the ones used in the `user_item.csv`, `session_item.csv` and `item_categories.csv` data sets.-->
user_item.csvThis file contains the user-item interactions.<br> This is a tab separated list with 3 columns: user ID | item ID | timestamp |
session_item.csvThis file contains sessions and their associated items. Each session has at least 2 items.<br> This is a tab separated list with 2 columns: session ID | item ID | <!--<br>The session IDs are the ones used in the `session_bundle.csv` and `user_session.csv` data sets.-->
user_session.csvThis file contains users and their associated sessions.<br> This is a tab separated list with 3 columns: user ID | session ID | timestamp |
session_bundle.csvThis file contains sessions and their detected bundles. Each session has at least 1 bundle.<br> This is a tab separated list with 2 columns: session ID | bundle ID | <!--<br>The bundle IDs are the ones used in the `bundle_item.csv` ,`user_bundle.csv` and `bundle_intent.csv` data sets.--> <br>The session ID contained in the session_item.csv but not in session_bundle.csv indicates there is no bundle detected in this session.
bundle_intent.csvThis file contains bundles and their annotated intents.<br> This is a tab separated list with 2 columns: bundle ID | intent |
bundle_item.csvThis file contains bundles and their associated items. Each bundle has at least 2 items.<br> This is a tab separated list with 2 columns: bundle ID | item ID |
user_bundle.csvThis file contains the user-bundle interactions.<br> This is a tab separated list with 3 columns: user ID | bundle ID | timestamp |
item_categories.csvThis file contains items and their affiliated categories.<br> This is a tab separated list with 2 columns: item ID | categories | <br> The format of data in categories column is a list of string.
item_titles.csvThis file contains items and their affiliated titles.<br> This is a tab separated list with 2 columns: item ID | titles |
item_idx_mapping.csvThis file contains items and their source ID in Amazon datasets.<br> This is a tab separated list with 2 columns: item ID | source ID |
user_idx_mapping.csvThis file contains users and their source ID in Amazon datasets.<br> This is a tab separated list with 2 columns: user ID | source ID |

Cite

Please cite the following papers if you use our dataset in a research paper:

@inproceedings{sun2022revisiting,
  title={Revisiting Bundle Recommendation: Datasets, Tasks, Challenges and Opportunities for Intent-Aware Product Bundling},
  author={Sun, Zhu and Yang, Jie and Feng, Kaidong and Fang, Hui and Qu, Xinghua and Ong, Yew Soon},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

@article{sun2024revisiting,
  title={Revisiting Bundle Recommendation for Intent-aware Product Bundling},
  author={Sun, Zhu and Feng, Kaidong and Yang, Jie and Fang, Hui and Qu, Xinghua and Ong, Yew-Soon and Liu, Wenyuan},
  journal={ACM Transactions on Recommender Systems},
  year={2024},
  publisher={ACM New York, NY}
}

Acknowledgements

Our datasets are constructed on the basis of Amazon datasets (http://jmcauley.ucsd.edu/data/amazon/links.html).

All pre-trained models in bundle generation explanation and bundle auto-naming are implemented based on Hugging Face (https://huggingface.co/).

All seq2seq models in bundle generation explanation are implemented based on PyTorch tutorials (https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).