Awesome

LC-Rec

This is the official PyTorch implementation for the paper:

Adapting Large Language Models by Integrating Collaborative Semantics for Recommendation

Overview

We propose LC-Rec, a new approach to integrate Language and Collaborative semantics for improving LLMs in Recommender systems. To tackle the large gap between the language semantics modeled by LLMs and collaborative semantics implied by recommender systems, we make two major contributions in two aspects. For item indexing, we design a learning-based vector quantization method with uniform semantic mapping, which can assign meaningful and non-conflicting IDs (called item indices) for items. For alignment tuning, we propose a series of specially designed tuning tasks to enhance the integration of collaborative semantics in LLMs. Our fine-tuning tasks enforce LLMs to deeply integrate language and collaborative semantics (characterized by the learned item indices), so as to achieve an effective adaptation to recommender systems.

model

Requirements

torch==1.13.1+cu117
accelerate
bitsandbytes
deepspeed
evaluate
peft
sentencepiece
tqdm
transformers

Model Checkpoint

The delta weights on the three datasets can be downloaded from huggingface hub (Instruments, Arts, Games). After downloading, you can add our deltas to the original LLaMA weights to obtain LC-Rec weights:

Get the original LLaMA weights.
Use the following scripts to get LC-Rec weights by applying our delta.

python -m convert/merge_delta.py \
    --base-model-path /path/to/llama-7b \
    --target-model-path /path/output/lc-rec \
    --delta-path bwzheng0324/lc-rec-games-delta

Dataset

We use three datasets in our paper, all of which have been uploaded to Google Drive

Train

The detailed scripts for all three datasets are in run.sh:

DATASET=Games
BASE_MODEL=huggyllama/llama-7b
DATA_PATH=./data
OUTPUT_DIR=./ckpt/$DATASET/

torchrun --nproc_per_node=8 --master_port=23324 finetune.py \
    --base_model $BASE_MODEL \
    --output_dir $OUTPUT_DIR \
    --dataset $DATASET \
    --data_path $DATA_PATH \
    --per_device_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --learning_rate 5e-5 \
    --epochs 4 \
    --weight_decay 0.01 \
    --save_and_eval_strategy epoch \
    --deepspeed ./config/ds_z3_bf16.json \
    --bf16 \
    --only_train_response \
    --tasks seqrec,item2index,index2item,fusionseqrec,itemsearch,preferenceobtain \
    --train_prompt_sample_num 1,1,1,1,1,1 \
    --train_data_sample_num 0,0,0,100000,0,0 \
    --index_file .index.json


cd convert
nohup ./convert.sh $OUTPUT_DIR >convert.log 2>&1 &
cd ..

Test

Test with a single GPU:

DATASET=Games
DATA_PATH=./data
CKPT_PATH=./ckpt/$DATASET/
RESULTS_FILE=./results/$DATASET/result.json

python test.py \
    --gpu_id 0 \
    --ckpt_path $CKPT_PATH \
    --dataset $DATASET \
    --data_path $DATA_PATH \
    --results_file $RESULTS_FILE \
    --test_batch_size 1 \
    --num_beams 20 \
    --test_prompt_ids all \
    --index_file .index.json

Test with multiple GPUs:

DATASET=Games
DATA_PATH=./data
CKPT_PATH=./ckpt/$DATASET/
RESULTS_FILE=./results/$DATASET/result.json

torchrun --nproc_per_node=8 --master_port=23324 test_ddp.py \
    --ckpt_path $CKPT_PATH \
    --dataset $DATASET \
    --data_path $DATA_PATH \
    --results_file $RESULTS_FILE \
    --test_batch_size 1 \
    --num_beams 20 \
    --test_prompt_ids all \
    --index_file .index.json

Acknowledgement

The implementation is based on HuggingFace.

Please cite the following paper as the reference if you use our codes or the processed datasets.

@inproceedings{zheng2024adapting,
  title={Adapting large language models by integrating collaborative semantics for recommendation},
  author={Zheng, Bowen and Hou, Yupeng and Lu, Hongyu and Chen, Yu and Zhao, Wayne Xin and Chen, Ming and Wen, Ji-Rong},
  booktitle={2024 IEEE 40th International Conference on Data Engineering (ICDE)},
  pages={1435--1448},
  year={2024},
  organization={IEEE}
}