Home

Awesome

ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models

ONCE

Notes

All the recommendation experiments are conducted under our content based recommendation repository Legommenders. It involves a set of news recommenders and click-through rate prediction models. It is a modular-design framework, supporting the integration with pretrained language models (PLMs) and large language models (LLMs).

Updated on Nov. 27, 2023: We fix the minor bugs in the README and add the citation.

Updated on Nov. 5, 2023: We have released the majority of our datasets here, based on which you can reproduce the content summarizer of GENRE and all the open-source results of DIRE.

Updated on Oct. 22, 2023: We have our training code and pipeline established.

Updated on Oct. 20, 2023: Our paper has been accepted by WSDM 2024.

Abstract

Personalized content-based recommender systems have become indispensable tools for users to navigate through the vast amount of content available on platforms like daily news websites and book recommendation services. However, existing recommenders face significant challenges in understanding the content of items. Large language models (LLMs), which possess deep semantic comprehension and extensive knowledge from pretraining, have proven to be effective in various natural language processing tasks. In this study, we explore the potential of leveraging both open- and closed-source LLMs to enhance content-based recommendation. With open-source LLMs, we utilize their deep layers as content encoders, enriching the representation of content at the embedding level. For closed-source LLMs, we employ prompting techniques to enrich the training data at the token level. Through comprehensive experiments, we demonstrate the high effectiveness of both types of LLMs and show the synergistic relationship between them. Notably, we observed a significant relative improvement of up to 19.32% compared to existing state-of-the-art recommendation models. These findings highlight the immense potential of both open- and closed-source of LLMs in enhancing content-based recommendation systems. We will make our code and LLM-generated data available for other researchers to reproduce our results.

GENRE: Prompting Closed-source LLMs for Content-based Recommendation

Overview

We call GPT-3.5-turbo API provided by OpenAI as closed-source LLM. All the request codes are uploaded in this repository.

Codes and corresponding generated data

DatasetSchemesRequest CodeGenerated Data
MINDContent Summarizernews_summarizer.pydata/mind/news_summarizer.log
MINDUser Profileruser_profiler_mind.pydata/mind/user_profiler.log
MINDPersonalized Content Generatorpersonalized_news_generator.pydata/mind/generator_v1.log, data/mind/generator_v2.log
GoodreadsContent Summarizerbook_summarizer.pydata/goodreads/book_summarizer.log
GoodreadsUser Profileruser_profiler_goodreads.pydata/goodreads/user_profiler.log
GoodreadsPersonalized Content Generatorpersonalized_book_generator.pydata/goodreads/generator_v1.log, data/goodreads/generator_v2.log

DIRE: Finetuning Open-source LLMs for Content-based Recommendation

We use BERT-12L, LLaMA-7B and LLaMA-13B as open-source LLMs.

Training Pipeline

Overview

According to the Legommenders framework, we have the following training pipeline:

GENRE and DIRE will be integrated at different stages.

PipelineORIGINALGENREDIREComments
Data Tokenization×New data generated by GENRE will be tokenized.
Config: Data Selection×DIRE can use the same data as the original one.
Config: Lego Selection×GENRE can use the same lego modules as the original one.
Config: Weight Init.×GENRE can use the same weight initialization as the original one.
Config: Hyperparameters×GENRE can use the same hyperparameters as the original one.
Training Prep.××Only DIRE needs to cache the upper layer hidden states.
Training

Preparation

Data Tokenization (Optional)

Please refer to process/mind/processor_unitokv3.py and process/goodreads/ in the Legommenders repo for the preprocessing scripts. More detailed instructions can be found in UnifiedTokenizer repository, which is the tokenization toolkits used for Legommenders. To integrate GENRE-generated data, similar operations should be conducted, or you can directly use the tokenized data provided by us.

Configurations

Please refer to config/data/mind.yaml and config/data/goodreads.yaml for the data selection.

Please refer to config/model/lego_naml.yaml and other config files for the lego module selection.

Please refer to config/embed/null.yaml and other config files for the weight initialization.

Please refer to config/exp/tt-naml.yaml and other config files for the hyperparameters setting.

Hyperparameters in running worker.py:

Training Preparation (for tuning LLaMA)

python worker.py --embed config/embed/<embed>.yaml --model config/model/llama-naml.yaml --exp config/exp/llama-split.yaml --data config/data/mind-llama.yaml --version small --llm_ver <llm_ver> --hidden_size 64 --layer 0 --lora 0 --fast_eval 0 --embed_hidden_size <embed_hidden_size>

Training and Testing

python worker.py --data config/data/mind-llama.yaml --embed config/embed/<embed>.yaml --model config/model/bert-<basemodel>.yaml --exp config/exp/tt-llm.yaml --embed_hidden_size <embed_hidden_size> --llm_ver <llm_ver> --layer <layer> --version small --lr 0.0001 --item_lr 0.00001 --batch_size 32 --acc_batch 2 --epoch_batch -4

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{liu2023once,
  title={ONCE: Boosting Content-based Recommendation with Both Open- and Closed-source Large Language Models},
  author={Qijiong Liu and Nuo Chen and Tetsuya Sakai and Xiao-Ming Wu},
  booktitle={Proceedings of the Seventeen ACM International Conference on Web Search and Data Mining},
  year={2024}
}