Awesome
Fantastic Rewards and How to Tame Them: A Case Study on Reward Learning for Task-Oriented Dialogue Systems
Dependency
To install the required packages, first create and activate a fantastic_reward
environment in conda.
Then execute the following command:
bash install_packages.sh
Experiments
Data Setup
Our data-setup follows the CASPI paper.
Please download the pre-processed data from here.
Unzip the downloaded file and put the resulting folder ExpStoreddata
into the folder damd_multiwoz
.
Training the Reward and Response Models
For our variant of RewardNet+GS $N = 3,\Phi=(\cdot)^1$ in Table 1 of the paper, please run the following command
bash ./run_multiple_seeds.sh --EXP_IDX ${EXP_IDX} --REWARD_SAMPLES 3 --REWARD_LOSS "listNet" --LISTMLE_TEMP 1 --LISTNET_POW 1 --POLICY_TRAIN_DATA_FRAC 1 --NEG_REW_WEIGHT 0.1 --REW_MODEL_EXP '0'
where ${EXP_IDX}
is the index of the experiment, such as "2023"
.
For our variant of RewardMLE+GS $N = 5,\Phi=\exp(\cdot)$ in Table 1 of the paper, please run the following command
bash ./run_multiple_seeds.sh --EXP_IDX ${EXP_IDX} --REWARD_SAMPLES 5 --REWARD_LOSS "listMLE" --LISTMLE_TEMP 1 --LISTNET_POW 0 --POLICY_TRAIN_DATA_FRAC 1 --NEG_REW_WEIGHT 1.0 --REW_MODEL_EXP '0'
where ${EXP_IDX}
is again the index of the experiment.
Evaluating the Released Checkpoints
To facilitate reproducibility, we release a checkpoint for each of the variant
RewardNet+GS $N = 3,\Phi=(\cdot)^1$ and RewardMLE+GS $N = 5,\Phi=\exp(\cdot)$ in Table 1 of the paper.
The released checkpoints are both trained under the random seed 999
of the tested five seeds (111 333 555 777 999)
.
To evaluate the checkpoints, please try the following steps.
Here Exp1
corresponds to the variant of RewardNet+GS $N =3,\Phi=(\cdot)^1$ and Exp2
for RewardMLE+GS $N = 5,\Phi=\exp(\cdot)$.
- Download and unzip the checkpoints from here.
- Download and unzip the processed data from here. Put the resulting folders into the folder
damd_multiwoz
. - Try the following command
python train.py --model_path "experiments/Exp${EXP_IDX}/all_sd999/" \
--mode 'test' --context_window 2 --pretrained_checkpoint bart-large-cnn \
--back_bone bart --cfg seed=999 cuda_device=0 batch_size=8 early_stop_count=7 \
--caspi_returns_file="fn_Gs_10_0.0_resp_soft.json" --caspi_wt=5. \
--caspi_data_file=data_for_damd.json --caspi_val_fraction=.5 --caspi --data_folder "Exp${EXP_IDX}data/s999_K10_GAMMA0.0" \
--exp_idx ${EXP_IDX}
where ${EXP_IDX}
should be replaced by 1
or 2
.
Standardized Evaluation Results
The following table shows the standardized evaluation results of our ``RewardNet+GS'' model.
Detailed numbers are provided in Example_generation/result_standard_eval.json
.
BLEU | Inform | Success | Combined Score | Av. len. | CBE | #uniq. words | #uniq. 3-grams |
---|---|---|---|---|---|---|---|
17.6 | 87.6 | 81.5 | 102.2 | 13.22 | 1.99 | 423 | 3942 |
Examples of generated dialogues on the test-split of MultiWOZ2.0 can be found at Example_generation/gen_test_formatted.json
.
Acknowledgement
This codebase builds on the following codebases and datasets: