Awesome
RGB
- An implementation for Benchmarking Large Language Models in Retrieval-Augmented Generation
News
- [2024/03] We refine the retrieved documents and some answers of
en.json
andzh.json
, and name the new data files asen_refine.json
andzh_refine.json
.
Quick links
Environment
conda create -n rgb python=3.10.0
conda activate rgb
bash env.sh
Retrieval-Augmented Generation Benchmark
The data is putted in data/
data/
├── en.json
├── en_refine.json
├── en_int.json
├── en_fact.json
├── zh.json
├── zh_refine.json
├── zh_int.json
└── zh_fact.json
To evalute the Information Integration, you should use zh_int
or en_int
for Chinese questions or English questions.
To evalute the Counterfactual Robustness, you should use zh_fact
or en_fact
for Chinese questions or English questions.
The refined data
We refine the retrieved documents and some answers of en.json
and zh.json
, and name the new data files as en_refine.json
and zh_refine.json
:
-
Removing incorrect positive and negative documents
-
Adding some positive documents.
-
Correcting some inaccurate answers.
Evaluation
For evaluating ChatGPT, you can run as:
python evalue.py \
--dataset en \
--modelname chatgpt \
--temp 0.2 \
--noise_rate 0.6 \
--api_key YourAPIKEY \
--passage_num 5
For evaluating other models, you can run as:
python evalue.py \
--dataset en \
--modelname chatglm2-6b \
--temp 0.2 \
--noise_rate 0.6 \
--plm THUDM/chatglm-6b \
--passage_num 5
You should change modelname
and plm
for different models, where plm
is the path of model.
temp
is the temperature of model.
noise_rate
is rate of noisy documents in inputs.
passage_num
is number of provided documents for LLM (default is 5).
The outputs are:
- all_rate: The accuracy (noise_rate<1) or rejection rate (noise_rate=1)
- fact_check_rate: the error detection rates (ED)
To evaluate rejection using ChatGPT, you should first run the evalue.py
in noise_rate=1 to obtain the generation result, and then run:
python reject_evalue.py \
--dataset en \
--modelname chatglm2-6b \
--api_key YourAPIKEY
The "reject_rate" in the outputs are the reject rate (Rej*).
To evaluate counterfactual robustness using ChatGPT, you should first run the evalue.py
in dataset=en_fact/zh_fact to obtain the generation result, and then run:
python fact_evalue.py \
--dataset en_fact \
--modelname chatglm2-6b \
--api_key YourAPIKEY
The "reject_rate" in the outputs are the error detection rates (ED*). The correct_rate
in the outputs are the error correction rate (CR)
License
The code and data are released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License for Noncommercial use only. Any commercial use should get formal permission first.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.