Home

Awesome

Tree of Clarifications (ToC)

This is the official repository for our EMNLP 2023 paper: Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models

<div align="center"> <img alt="ToC Overview" src="https://github.com/gankim/tree-of-clarifications/blob/main/assets/overview.png" width="400px"> </div>

Summary

We propose a novel framework, <b>Tree of Clarifications (ToC)</b> designed for generating long-form answers to ambiguous questions.

Environments

To facilitate a smooth setup, we suggest creating a Conda environment using the provided configuration::

conda env create -f environment.yml

Activate the newly created environment with:

conda activate toc

Preparing the Dataset

Access and download the ASQA dataset here or utilize the pre-packaged version in our repository at ./asqa/ASQA.json

Integrating Bing Search Engine Results

ToC is capable of incorporating search results from external sources, such as the Bing search engine, to enhance answer quality. Follow the script below to fetch search results, or use our pre-compiled dataset at ./bing/results.json. Omitting this step is an option but may slightly impact ToC's performance.

Set your Bing API credentials:

export BING_SUBSCRIPTION_KEY= # your Bing API key here
export BING_SEARCH_URL= # your Bing search URL here

Please refer to the tutorial for detailed information about setting up your subscription.

Set the directory paths for the ASQA dataset and Bing search results. Run the following script to search Wikipedia documents relevant to ambiguous questions and save the results in $BING_DIR.

export ASQA_DIR= # directory path to the ASQA dataset
export BING_DIR= # directory path to Bing search results

python bing_search.py \
    --data_dir $ASQA_DIR \
    --output_dir $BING_DIR

python get_wiki.py \
    --data_dir $BING_DIR \
    --output_dir "top100" \
    --top_k 100 \

Answering ambiguous questions with ToC

Before running ToC, you need to specify the following. Fill openAI API key by referring to the homepage and specify colbert server url. We utilized the server hosted by DSPy. Please note that the hosting server may change. For setting up your server, refer to the instructions here

export OPENAI_KEY= # your OpenAI API key here
export COLBERT_URL= 'http://ec2-44-228-128-229.us-west-2.compute.amazonaws.com:8893/api/search' 

To run ToC, use the following script, specifying the necessary paths and options:

export ASQA_DIR= # directory path to the ASQA dataset
export OUT_DIR= # directory path to results

python run_toc.py \
    --data_dir $ASQA_DIR \
    --bing_path $BING_PATH \ # Optional
    --openai_key $OPENAI_KEY \
    --colbert_url $COLBERT_URL \
    --verify \
    --output_dir $OUT_DIR \
    ${ARGS}

Evaluating the long-form answers

To evaluate the answers generated by ToC, follow the guidelines provided in the official ASQA repository.

Reference

@article{kim2023tree,
  title={Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models},
  author={Gangwoo Kim and Sungdong Kim and Byeongguk Jeon and Joonsuk Park and Jaewoo Kang},
  journal={EMNLP},
  year={2023}
}