Awesome

Bundle Recommendation

This project aims to provide new data sources for product bundling in real e-commerce platforms with the domains of Electronic, Clothing and Food. We construct three high-quality bundle datasets with rich meta information, particularly bundle intents, through a carefully designed crowd-sourcing task.

1. Worker Basic Information

Figure 1 shows the distribution of workers' age, education, country, occupation, gender and shopping frequency for the two batches. In particular, `Others' in the country distribution includes Argentina, Australia, Anguilla, Netherlands, Albania, Georgia, Tunisia, Belgium, Armenia, Guinea, Austria, Switzerland, Iceland, Lithuania, Egypt, Venezuela, Bangladesh, American Samoa, Vanuatu, Colombia, United Arab Emirates, Ashmore and Cartier Island, Estados Unidos, Wales, Turkey, Angola, Scotland, Philippines, Iran and Bahamas.

basic_information

Figure 1: Worker basic information in the first and second batches.

2. Feedback from Workers

Figure 2 visualizes worker feedback towards our task, where (a) shows the feedback regarding the difficulty on identifying bundles and naming the corresponding intents for the two batches; and (b-c) depict the general feedback for the two batches.

<img src="img/worker_feedback.png" width="90%" height="90%"> Figure 2: Feedback from workers for the two batches.

3. Bundle Detection

Source code

Pattern Mining (code)

Parameter Settings

A grid search in {0.0001, 0.001, 0.01} is applied to find out the optimal settings for support and confidence, and both are set as 0.001 across the three domains.

4. Bundle Completion

Source code

ItemKNN & BPRMF (code)
Mean-VAE & Concat-VAE & Concat-CVAE (code)
TSF (code)

Parameter Settings

The dimension d of item and bundle representations for all methods is 20. Grid search is adopted to find out the best settings for other key parameters. In particular, learning rate $\eta$ and regularization coefficient $\lambda$ are searched in {0.0001, 0.001, 0.01}; the number of neighbors K in ItemKNN is searched in {10, 20, 30, 50}; the weight of KL divergence $\alpha$ in VAE is searched in {0.001, 0.01, 0.1}; the hidden layer sizes $hid\_layers$ are searched in {50,100,200}; The batch size is searched in {64, 128, 256}. The number of heads (i.e., n_heads) for TSF is searched in {1,2,4}. The number of layers for TSF is searched in {1,2,3}. The optimal parameter settings are shown in Table 1.

Table 1: Parameter settings for bundle completion (d=20).

	Electronic	Clothing	Food
ItemKNN	$K=10$	$K=10$	$K=10$
BPRMF	$\eta=0.0001$ <br> $\lambda=0.001$ <br> $neg\_sample=2$ <br> $batch\_size=128$	$\eta=0.0001$ <br> $\lambda=0.01$ <br> $neg\_sample=2$ <br> $batch\_size=128$	$\eta=0.01$ <br> $\lambda=0.01$ <br> $neg\_sample=2$ <br> $batch\_size=128$
mean-VAE	$\eta=0.0001$ <br> $\lambda=0.001$ <br> $\alpha=0.01$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=64$	$\eta=0.0001$ <br> $\lambda=0.0001$ <br> $\alpha=0.001$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=128$	$\eta=0.0001$ <br> $\lambda=0.001$ <br> $\alpha=0.001$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=64$
concat-VAE	$\eta=0.0001$ <br> $\lambda=0.01$ <br> $\alpha=0.001$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=128$	$\eta=0.001$ <br> $\lambda=0.001$ <br> $\alpha=0.1$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=64$	$\eta=0.0001$ <br> $\lambda=0.0001$ <br> $\alpha=0.001$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=64$
concat-CVAE	$\eta=0.001$ <br> $\lambda=0.001$ <br> $\alpha=0.01$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=64$	$\eta=0.001$ <br> $\lambda=0.0001$ <br> $\alpha=0.01$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=64$	$\eta=0.0001$ <br> $\lambda=0.001$ <br> $\alpha=0.001$ <br> $hid\_layers=[100,50]$ <br> $dropout=0.5$ <br> $batch\_size=64$
TSF	$\eta=0.001$ <br> $n\_heads=2$ <br> $layers=1$ <br> $batch\_size=64$	$\eta=0.001$ <br> $n\_heads=1$ <br> $layers=1$ <br> $batch\_size=128$	$\eta=0.001$ <br> $n\_heads=1$ <br> $layers=2$ <br> $batch\_size=128$

5. Bundle Ranking

Source code

ItemKNN & BPRMF (code)
DAM (code)
AttList (code)
GCN & BGCN (code)

Parameter Settings

The dimension d of representations is set as 20. We apply a same grid search for $\eta$ , $\lambda$ , $K$ and batch size as in bundle completion. Besides, the predictive layer D for AttList is searched from {20, 50, 100}; the node and message dropout rate for GCN and BGCN is searched in {0, 0.1, 0.3, 0.5}. As the training complexity for GCN and BGCN is quite high, we set the batch size as 2048 as suggested by the original paper. The optimal parameter settings are presented in Table 2. Note that the parameter settings for BGCN is the version without pre-training (i.e. $BGCN_{w/o\ pre}$ ).

Table 2: Parameter settings for bundle ranking (d=20).

	Electronic	Clothing	Food
ItemKNN	$K=10$	$K=10$	$K=10$
BPRMF	$\eta=0.0001$ <br> $\lambda=0.001$ <br> $neg\_sample=2$ <br> $batch\_size=128$	$\eta=0.0001$ <br> $\lambda=0.01$ <br> $neg\_sample=2$ <br> $batch\_size=128$	$\eta=0.0001$ <br> $\lambda=0.0001$ <br> $neg\_sample=2$ <br> $batch\_size=128$
DAM	$\eta=0.01$ <br> $neg\_sample=1$ <br> $dropout=0.5$	$\eta=0.01$ <br> $neg\_sample=1$ <br> $dropout=0.5$	$\eta=0.01$ <br> $neg\_sample=1$ <br> $dropout=0.5$
AttList	$\eta=0.001$ <br> $neg\_sample=2$ <br> $\&hash;bundles/user=5$ <br> $\&hash;items/bundle=10$ <br> $D=100$ <br> $dropout=0.5$ <br> $batch\_size=64$	$\eta=0.0001$ <br> $neg\_sample=2$ <br> $\&hash;bundles/user=5$ <br> $\&hash;items/bundle=10$ <br> $D=50$ <br> $dropout=0.5$ <br> $batch\_size=128$	$\eta=0.001$ <br> $neg\_sample=2$ <br> $\&hash;bundles/user=5$ <br> $\&hash;items/bundle=10$ <br> $D=50$ <br> $dropout=0.5$ <br> $batch\_size=256$
GCN	$\eta=0.01$ <br> $\lambda=0.01$ <br> $neg\_sample=1$ <br> $msg\_dropout=0.3$ <br> $node\_dropout=0$ <br> $prop\_layers=2$ <br> $batch\_size=2048$	$\eta=0.001$ <br> $\lambda=0.0001$ <br> $neg\_sample=1$ <br> $msg\_dropout=0.5$ <br> $node\_dropout=0$ <br> $prop\_layers=2$ <br> $batch\_size=2048$	$\eta=0.01$ <br> $\lambda=0.0001$ <br> $neg\_sample=1$ <br> $msg\_dropout=0.5$ <br> $node\_dropout=0$ <br> $prop\_layers=2$ <br> $batch\_size=2048$
BGCN	$\eta=0.001$ <br> $\lambda=0.001$ <br> $neg\_sample=1$ <br> $msg\_dropout=0.1$ <br> $node\_dropout=0$ <br> $prop\_layers=2$ <br> $batch\_size=2048$	$\eta=0.001$ <br> $\lambda=0.0001$ <br> $neg\_sample=1$ <br> $msg\_dropout=0$ <br> $node\_dropout=0$ <br> $prop\_layers=2$ <br> $batch\_size=2048$	$\eta=0.01$ <br> $\lambda=0.001$ <br> $neg\_sample=1$ <br> $msg\_dropout=0.1$ <br> $node\_dropout=0.1$ <br> $prop\_layers=2$ <br> $batch\_size=2048$

6. Bundle Generation Explanation

Source code

Please refer to code

Parameter Settings

For LSTM, BiLSTM and Transformer, the dimension of word embeddings is 300; learning rate $\eta$ is searched in {0.0001, 0.001, 0.01}; batch size is searched in {16, 32, 64}; the hidden size is searched in {128, 256, 512}; the number of heads (i.e., nhead) for Transformer is searched in the range of [1, 8] stepped by 1; the number of encoder/decoder layers is searched in {1, 2, 3, 4}. For the pre-trained models, i.e., BertGeneration, BART-base and T5-base, the maximum length in encoder is set to 512, and the maximum length in decoder is set to 64; learning rate $\eta$ is searched in {0.00002, 0.00005, 0.00007, 0.0001}; the number of epochs is searched in {3, 4, 5}. The optimal parameter settings are shown in Table 3.

Table 3: Parameter settings for bundle generation explanation.

	Optimal Parameter Settings
LSTM	$\eta=0.001,batch\_size=64,hidden\_size=512$ <br> $encoder/decoder\_layers=2$
BiLSTM	$\eta=0.001,batch\_size=64,hidden\_size=512$ <br> $encoder/decoder\_layers=2$
Transformer	$\eta=0.0001,batch\_size=16,hidden\_size=512$ <br> $nhead=6,encoder/decoder\_layers=3$
BertGeneration	$\eta=0.00007,batch\_size=4,epochs=4$
BART-base	$\eta=0.00002,batch\_size=4,epochs=3$
T5-base	$\eta=0.00007,batch\_size=4,epochs=3$

7. Bundle Ranking Explanation

Source code

RM (code)
EFM (code)
PGPR (code)
KGAT (code)

Parameter Settings

For RM, we apply a grid search in {0.0001, 0.001, 0.01, 0.1} for support and confidence; and a grid search in {1, 2, 3, 4} for lift to find out their optimal settings. For EFM, the regularization coefficients $\lambda_{x}$ and $\lambda_{y}$ are searched in the range of (0, 1] with a step of 0.1, while $\lambda_{u}$ , $\lambda_{h}$ and $\lambda_{v}$ are searched in {0.0001, 0.001, 0.01, 0.1}; the total number of factors r is searched from {20, 50, 100}; the ratio of explicit factors $r_e$ is searched in a range of [0, 1] with a step of 0.1; the number of most cared features k in searched from [10, 100] with a step of 10. For PGPR and KGAT, we apply the same grid search for $\eta$ , batch size, the node and message dropout rate as in bundle ranking; the dimension of representations (d) is searched in {20, 50, 100}; the action space and the weight of entropy loss for PGPR are searched in {100, 200, 300} and {0.0001, 0.001, 0.01}, respectively; the sampling sizes at the 3 steps (i.e., $K_1,K_2,K_3$ ) for PGPR are searched in {20, 25, 30}, {5, 10, 15} and {1}, respectively; and the regularization coefficient $\lambda$ for KGAT is searched from {0.0001, 0.001, 0.01, 0.1}. The optimal parameter settings are shown in Table 4.

Table 4: Parameter settings for bundle ranking explanation.

	Electronic	Clothing	Food
RM	$support=0.0001$ <br> $confidence=0.01$ <br> $lift=3$	$support=0.0001$ <br> $confidence=0.01$ <br> $lift=3$	$support=0.0001$ <br> $confidence=0.1$ <br> $lift=3$
EFM	$\lambda_{x}=\lambda_{y}=0.1$ <br> $\lambda_{u}=\lambda_{v}=\lambda_{h}=0.001$ <br> $r=100$ <br> $r_e=0.2$ <br> $k=30$	$\lambda_{x}=\lambda_{y}=0.9$ <br> $\lambda_{u}=\lambda_{v}=\lambda_{h}=0.001$ <br> $r=100$ <br> $r_e=0.4$ <br> $k=30$	$\lambda_{x}=\lambda_{y}=0.5$ <br> $\lambda_{u}=\lambda_{v}=\lambda_{h}=0.0001$ <br> $r=100$ <br> $r_e=0.4$ <br> $k=30$
PGPR	$\eta=0.001$ <br> $d=50$ <br> $ent\_weight=0.001$ <br> $act\_space=300$ <br> $dropout=0.5$ <br> $K_1=20,K_2=5,K_3=1$ <br> $batch\_size=64$	$\eta=0.0001$ <br> $d=100$ <br> $ent\_weight=0.0001$ <br> $act\_space=300$ <br> $dropout=0.5$ <br> $K_1=25,K_2=5,K_3=1$ <br> $batch\_size=128$	$\eta=0.0001$ <br> $d=100$ <br> $ent\_weight=0.0001$ <br> $act\_space=200$ <br> $dropout=0.5$ <br> $K_1=20,K_2=5,K_3=1$ <br> $batch\_size=64$
KGAT	$\eta=0.01$ <br> $\lambda=0.0001$ <br> $d=50$ <br> $msg\_dropout=0.8$ <br> $node\_dropout=0$ <br> $batch\_size=64$	$\eta=0.01$ <br> $\lambda=0.0001$ <br> $d=50$ <br> $msg\_dropout=0.1$ <br> $node\_dropout=0.3$ <br> $batch\_size=64$	$\eta=0.01$ <br> $\lambda=0.0001$ <br> $d=50$ <br> $msg\_dropout=0.5$ <br> $node\_dropout=0.5$ <br> $batch\_size=64$

8. Bundle Auto-Naming

Source code

Please refer to code

Parameter Settings

For the ImageCap model, the maximum length in decoder is set to 64; learning rate $\eta$ is searched in {0.00002, 0.00005, 0.00007, 0.0001}; the number of epochs is searched in {3, 4, 5}. The optimal parameter settings are shown in Table 5.

Table 5: Parameter settings for bundle generation explanation.

	Optimal Parameter Settings
ImageCap	$\eta=0.00005,batch\_size=4,epochs=3$

9. Statistics of Datasets

Table 6: Statistics of datasets.

	Electronic	Clothing	Food
#Users	888	965	879
#Items	3499	4487	3767
#Sessions	1145	1181	1161
#Bundles	1750	1910	1784
#Intents	1422	1466	1156
Average Bundle Size	3.52	3.31	3.58
#User-Item Interactions	6165	6326	6395
#User-Bundle Interactions	1753	1912	1785
Density of User-Item Interactions	0.20%	0.15%	0.19%
Density of User-Bundle Interactions	0.11%	0.10%	0.11%

10. Descriptions of Data Files

Under the 'dataset' folder, there are three domains, including clothing, electronic and food. Each domain contains the following 9 data files.

Table 7: The descriptions of the data files.

File Name	Descriptions
user_item_pretrain.csv	This file contains the user-item interactions aiming to obtain the pre-trained item representations via BPRMF for model initialization.<br> This is a tab separated list with 3 columns: `user ID \| item ID \| timestamp \|`<!--<br>The user IDs are the ones used in the `user_bundle.csv` and `user_item.csv` data sets. The item IDs are the ones used in the `user_item.csv`, `session_item.csv` and `item_categories.csv` data sets.-->
user_item.csv	This file contains the user-item interactions.<br> This is a tab separated list with 3 columns: `user ID \| item ID \| timestamp \|`
session_item.csv	This file contains sessions and their associated items. Each session has at least 2 items.<br> This is a tab separated list with 2 columns: `session ID \| item ID \|` <!--<br>The session IDs are the ones used in the `session_bundle.csv` and `user_session.csv` data sets.-->
user_session.csv	This file contains users and their associated sessions.<br> This is a tab separated list with 3 columns: `user ID \| session ID \| timestamp \|`
session_bundle.csv	This file contains sessions and their detected bundles. Each session has at least 1 bundle.<br> This is a tab separated list with 2 columns: `session ID \| bundle ID \|` <!--<br>The bundle IDs are the ones used in the `bundle_item.csv` ,`user_bundle.csv` and `bundle_intent.csv` data sets.--> <br>The session ID contained in the session_item.csv but not in session_bundle.csv indicates there is no bundle detected in this session.
bundle_intent.csv	This file contains bundles and their annotated intents.<br> This is a tab separated list with 2 columns: `bundle ID \| intent \|`
bundle_item.csv	This file contains bundles and their associated items. Each bundle has at least 2 items.<br> This is a tab separated list with 2 columns: `bundle ID \| item ID \|`
user_bundle.csv	This file contains the user-bundle interactions.<br> This is a tab separated list with 3 columns: `user ID \| bundle ID \| timestamp \|`
item_categories.csv	This file contains items and their affiliated categories.<br> This is a tab separated list with 2 columns: `item ID \| categories \|` <br> The format of data in `categories` column is a list of string.
item_titles.csv	This file contains items and their affiliated titles.<br> This is a tab separated list with 2 columns: `item ID \| titles \|`
item_idx_mapping.csv	This file contains items and their source ID in Amazon datasets.<br> This is a tab separated list with 2 columns: `item ID \| source ID \|`
user_idx_mapping.csv	This file contains users and their source ID in Amazon datasets.<br> This is a tab separated list with 2 columns: `user ID \| source ID \|`

Cite

Please cite the following papers if you use our dataset in a research paper:

@inproceedings{sun2022revisiting,
  title={Revisiting Bundle Recommendation: Datasets, Tasks, Challenges and Opportunities for Intent-Aware Product Bundling},
  author={Sun, Zhu and Yang, Jie and Feng, Kaidong and Fang, Hui and Qu, Xinghua and Ong, Yew Soon},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

@article{sun2024revisiting,
  title={Revisiting Bundle Recommendation for Intent-aware Product Bundling},
  author={Sun, Zhu and Feng, Kaidong and Yang, Jie and Fang, Hui and Qu, Xinghua and Ong, Yew-Soon and Liu, Wenyuan},
  journal={ACM Transactions on Recommender Systems},
  year={2024},
  publisher={ACM New York, NY}
}

Acknowledgements

Our datasets are constructed on the basis of Amazon datasets (http://jmcauley.ucsd.edu/data/amazon/links.html).

All pre-trained models in bundle generation explanation and bundle auto-naming are implemented based on Hugging Face (https://huggingface.co/).

All seq2seq models in bundle generation explanation are implemented based on PyTorch tutorials (https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html).