Awesome
Tree of Clarifications (ToC)
This is the official repository for our EMNLP 2023 paper: Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models
<div align="center"> <img alt="ToC Overview" src="https://github.com/gankim/tree-of-clarifications/blob/main/assets/overview.png" width="400px"> </div>Summary
We propose a novel framework, <b>Tree of Clarifications (ToC)</b> designed for generating long-form answers to ambiguous questions.
- It guides LLMs to explore diverse interpretations of ambiguity in <u>a tree structure with the ability to prune</u> unhelpful ones
- We investigate combining <u>retrieval-augmented generation (RAG) with LLM</u> and achieve the state-of-the-art performance on ASQA
Environments
To facilitate a smooth setup, we suggest creating a Conda environment using the provided configuration::
conda env create -f environment.yml
Activate the newly created environment with:
conda activate toc
Preparing the Dataset
Access and download the ASQA dataset here or utilize the pre-packaged version in our repository at ./asqa/ASQA.json
Integrating Bing Search Engine Results
ToC is capable of incorporating search results from external sources, such as the Bing search engine, to enhance answer quality. Follow the script below to fetch search results, or use our pre-compiled dataset at ./bing/results.json. Omitting this step is an option but may slightly impact ToC's performance.
Set your Bing API credentials:
export BING_SUBSCRIPTION_KEY= # your Bing API key here
export BING_SEARCH_URL= # your Bing search URL here
Please refer to the tutorial for detailed information about setting up your subscription.
Set the directory paths for the ASQA dataset and Bing search results. Run the following script to search Wikipedia documents relevant to ambiguous questions and save the results in $BING_DIR
.
export ASQA_DIR= # directory path to the ASQA dataset
export BING_DIR= # directory path to Bing search results
python bing_search.py \
--data_dir $ASQA_DIR \
--output_dir $BING_DIR
python get_wiki.py \
--data_dir $BING_DIR \
--output_dir "top100" \
--top_k 100 \
Answering ambiguous questions with ToC
Before running ToC, you need to specify the following. Fill openAI API key by referring to the homepage and specify colbert server url. We utilized the server hosted by DSPy. Please note that the hosting server may change. For setting up your server, refer to the instructions here
export OPENAI_KEY= # your OpenAI API key here
export COLBERT_URL= 'http://ec2-44-228-128-229.us-west-2.compute.amazonaws.com:8893/api/search'
To run ToC, use the following script, specifying the necessary paths and options:
export ASQA_DIR= # directory path to the ASQA dataset
export OUT_DIR= # directory path to results
python run_toc.py \
--data_dir $ASQA_DIR \
--bing_path $BING_PATH \ # Optional
--openai_key $OPENAI_KEY \
--colbert_url $COLBERT_URL \
--verify \
--output_dir $OUT_DIR \
${ARGS}
Evaluating the long-form answers
To evaluate the answers generated by ToC, follow the guidelines provided in the official ASQA repository.
Reference
@article{kim2023tree,
title={Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models},
author={Gangwoo Kim and Sungdong Kim and Byeongguk Jeon and Joonsuk Park and Jaewoo Kang},
journal={EMNLP},
year={2023}
}