Home

Awesome

Adversarial Preference Optimization

Code License Data License Python 3.8+

This repo contains the implementation of the ACL 2024 paper:

In Adversarial Preference Optimization (APO), we let the reward model (RM) and LLM agent play a min-max game, through which both models can be further enhanced without additional preference annotation.

<p align="center"> <img src="figures/apo_framework_v.png" height="60%" width="60%"> </p>

For an overview, the repo contains:

Environment

We use Python3.8 with the dependencies listed in requirements.txt. To build the appropriate environment, use the following command:

pip3 install -r requirements.txt

Data & Annotation

To separately update RM and LLM, we split the cleaned Helpful&Harmless (HH) dataset into an RM training set and a LLM training set.

Data TypeHH-RM Train SetHH-LLM Train SetHH Test Set
Preference PairsRM training setRM validation set (sampled 10K pairs)RM testing set
Golden AnswersAPO positive responses
LLM SamplesAPO negative responses (alpaca_rm_samples)LLM alignment samples (alpaca_llm_samples)LLM testing Queries

On both HH-RM and HH-LLM training sets, we infer four LLM responses for each query as alpaca_rm_samples and alpaca_llm_samples. alpaca_rm_samples is combined with the golden responses on the HH-RM set as APO RM training pairs. alpaca_llm_samples is further scored by RMs and used for LLM alignment. To obtain LLM responses by yourself, run the command:

bash tools/llm_response_gen.sh

RM Training

Base RM Training

We build our RM on the pretrained LLaMA-7B (decapoda-research/llama-7b-hf). To train the base RM for rejection sampling, use the following command:

REPO_DIR=<path_to_this_repo>
DATA_DIR=${REPO_DIR}/data/hh-split
TRAIN_DATA_LIST="${DATA_DIR}/rm_data/hh_split_rm.train.json"
TEST_DATA_LIST="${DATA_DIR}/eval_data/hh_cleaned_origin.test.json\
		${DATA_DIR}/eval_data/hh_split_llm.valid.json"
		
NUM_GPUS=8
BATCH_SIZE=64
MICRO_BATCH_SIZE=1
LEARNING_RATE=1e-6
GRADIENT_ACCUMULATION_STEP=$((BATCH_SIZE / NUM_GPUS / MICRO_BATCH_SIZE))

torchrun --nproc_per_node=${NUM_GPUS} --master_port=6000 ${REPO_DIR}/train.py \
    --task_type hh_split \
    --do_train True \
    --eval_at_start False \
    --model_type reward \
    --model_name_or_path "decapoda-research/llama-7b-hf" \
    --data_type "comparison_pair" \
    --train_data_path ${TRAIN_DATA_LIST} \
    --eval_data_path ${TEST_DATA_LIST} \
    --rm_calibration True \
    --data_suffix rm_base \
    --add_sep_token True \
    --remove_unused_columns false \
    --output_dir <path_to_save_your_RM_checkpoint> \
    --num_train_epochs 1 \
    --per_device_train_batch_size ${MICRO_BATCH_SIZE} \
    --per_device_eval_batch_size ${MICRO_BATCH_SIZE} \
    --gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEP} \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --save_strategy steps \
    --learning_rate ${LEARNING_RATE} \
    --warmup_steps 100 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false

We also trained a testing RM to automatically evaluate the LLM response quality on the testing queries. To train the testing RM, change TRAIN_DATA_LIST=${DATA_DIR}/hh_cleaned_origin.train.json in the above command to learn with all the HH training comparisons.

The RM training data files (values in TRAIN_DATA_LIST) are lists of dictionaries, where each dictionary is an RM training item (--data_type="comparison_pair") including the following keys:

APO RM Training

To train the APO RM, first merge LLM samples and golden annotations into APO comparison pairs:

REPO_DIR=<path_to_this_repo>
DATA_DIR="${REPO_DIR}/data/hh-split"

python3 ${REPO_DIR}/tools/apo_data_converter.py \
	--golden_data_path ${DATA_DIR}/rm_data/hh_split_rm.golden.json \
	--sample_data_path ${DATA_DIR}/rm_data/hh_split_rm_alpaca_v0.sample.json \
	--output_dir ${DATA_DIR}/apo_data \
	--apo_data_name "rm_apo_data_v0"

Then use the following command to conduct APO RM finetuning:

REPO_DIR=<path_to_this_repo>
DATA_DIR=${REPO_DIR}/data/hh-split
TRAIN_DATA_LIST="${DATA_DIR}/rm_data/hh_split_rm.train.json \
		 ${DATA_DIR}/apo_data/rm_apo_data_v0_text_scores.json"
NUM_APO_SAMPLES=4

TEST_DATA_LIST="${DATA_DIR}/eval_data/hh_cleaned_origin.test.json \
		${DATA_DIR}/eval_data/hh_split_llm.valid.json"
		
NUM_GPUS=8
BATCH_SIZE=64
MICRO_BATCH_SIZE=1
LEARNING_RATE=1e-6
APO_COEFF=0.1
GRADIENT_ACCUMULATION_STEP=$((BATCH_SIZE / NUM_GPUS / MICRO_BATCH_SIZE))


torchrun --nproc_per_node=${NUM_GPUS} --master_port=6000 ${REPO_DIR}/train.py \
    --task_type apo \
    --do_train True \
    --eval_at_start False \
    --model_type reward \
    --model_name_or_path "decapoda-research/llama-7b-hf" \
    --data_type "comparison_pair" \
    --train_data_path ${TRAIN_DATA_LIST} \
    --eval_data_path ${TEST_DATA_LIST} \
    --rm_calibration True \
    --data_suffix rm_apo_v1 \
    --add_sep_token True \
    --remove_unused_columns false \
    --output_dir <path_to_save_your_APO_RM_checkpoint> \
    --num_train_epochs 1 \
    --apo_loss_coeff ${APO_COEFF} \
    --apo_sample_num ${NUM_APO_SAMPLES} \
    --per_device_train_batch_size ${MICRO_BATCH_SIZE} \
    --per_device_eval_batch_size ${MICRO_BATCH_SIZE} \
    --gradient_accumulation_steps ${GRADIENT_ACCUMULATION_STEP} \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --save_strategy steps \
    --save_total_limit 10 \
    --learning_rate ${LEARNING_RATE} \
    --warmup_steps 100 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false

RM Scoring

After finishing the RM training, we can use the following command to scoring new LLM samples:

REPO_DIR=<path_to_this_repo>
DATA_DIR=${REPO_DIR}/data/hh-split/llm_data
DATA_PATH="${DATA_DIR}/hh_split_llm_alpaca_v0.sample.json"

MODEL_PATH=<path_to_your_RM_checkpoint>
MODEL_NAME="base_rm" # or "apo_rm"

NUM_GPUS=8
MICRO_BATCH_SIZE=16

torchrun --nproc_per_node=${NUM_GPUS} --master_port=6000 ${REPO_DIR}/train.py \
    --task_type inference \
    --do_train False \
    --eval_at_start True \
    --model_type reward \
    --model_name_or_path ${MODEL_PATH} \
    --data_type "reject_sample" \
    --eval_data_path ${DATA_PATH} \
    --rm_calibration False \
    --data_suffix ${MODEL_NAME} \
    --add_sep_token True \
    --remove_unused_columns false \
    --output_dir <path_to_save_your_inference_results> \
    --per_device_eval_batch_size ${MICRO_BATCH_SIZE} \
    --evaluation_strategy steps \
    --padding_side right \
    --truncation_side left \
    --pooling_type last \
    --max_length 512 \
    --deepspeed configs/default_offload_opt_param.json \
    --tf32 false --fp16 false


# rejection sampling
SCORE_PATH=${DATA_PATH}_pred_${MODEL_NAME}_results.json
OUTPUT_FILE_NAME=${DATA_PATH}_rjs_${MODEL_NAME}.json

python3 ${REPO_DIR}/tools/rejection_sampling.py \
	--data_path ${DATA_DIR} \
	--score_path ${SCORE_PATH} \
	--output_dir ${DATA_DIR} \
	--rm_scorer  ${MODEL_NAME} \
	--output_file_name ${OUTPUT_FILE_NAME}

# remove tmp inference files
rm ${DATA_DIR}/*rank*.jsonl

After inference process, we obtain a RM scoring file ${DATA_PATH}_rjs_${MODEL_NAME}.json. Then we can update the Alpaca model with the training pipeline here.

Citation

@inproceedings{cheng2024adversarial,
  title={Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game},
  author={Cheng, Pengyu and Yang, Yifan and Li, Jian and Dai, Yong and Hu, Tianhao and Cao, Peixin and Du, Nan and Li, Xiaolong},
  booktitle={Findings of the Association for Computational Linguistics},
  year={2024}
}