Awesome
LLatrieval: LLM-Verified Retrieval for Verifiable Generation
This repository contains the code and data for paper LLatrieval: LLM-Verified Retrieval for Verifiable Generation. This repository also includes code to reproduce the method we propose in our paper.
:new:News
- [2024/03/13] Our submission to NAACL 2024, LLatrieval: LLM-Verified Retrieval for Verifiable Generation, has been accepted to the main conference.
- [2023/11/14] We have published the preprint version of the paper on arXiv.
- [2023/11/09] We have released the code for reproducing our method.
Quick Links
Requirements
- We recommend that you use the python virtual environment and then install the dependencies.
conda create -n lvr python=3.9.7
- Next, activate the python virtual environment you just created.
conda activate lvr
- Finally, before running the code, make sure you have set up the environment and installed the required packages.
pip install -r requirements.txt
Data
We uploaded the data to Hugging Face🤗.
Start by installing 🤗 Datasets:
pip install datasets
Load a dataset
This command will download the raw data to the data/
folder.
python download_data.py
Download corpus
Use the following command to download the BM25_SPHERE_CORPUS
.
wget -P faiss_index https://dl.fbaipublicfiles.com/sphere/sphere_sparse_index.tar.gz
tar -xzvf faiss_index/sphere_sparse_index.tar.gz -C faiss_index
Use the following command to download the WIKI_TSV_CORPUS
.
wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
gzip -xzvf psgs_w100.tsv.gz
For more info about the Sphere and Wikipedia snapshot corpora, please refer to ALCE.
Code Structure
commands/
: folder that contains all shell files.data/
: folder that contains all datasets.llm_retrieval_prompt_drafts/
: folder that contains all prompt files.llm_retrieval_related/
: folder that contains code for iteratively selecting supporting documents.multi_process/
: folder that contains code for BM25 retrieval with multi-process support.openai_account_files/
: folder that contains all OpenAI account files.prompts/
: folder that contains all instruction and demonstration files.eval.py
: eval file to evaluate generations.Iterative_retrieval.py
: code for reproduce our method.llm.py
: code for using LLM.multi_thread_openai_api_call.py
: code for using gpt-3.5-turbo with multi-thread.searcher.py
: code for retrieval using TfidfVectorizer.run.py
: run file to generate citations.utils.py
: file that contains auxiliary function.
Reproduce Our Method
NOTE: There must be raw data and a corpus for retrieval before running the following commands. Once you have them, you also need to modify the parameters of the corresponding files in the commands
directory.
For ASQA, use the following command
bash commands/asqa_iterative_retrieval.sh
For QAMPARI, use the following command
bash commands/qampari_iterative_retrieval.sh
For ELI5, use the following command
bash commands/eli5_iterative_retrieval.sh
The result will be saved in iter_retrieval_50/
.
Citation
@inproceedings{li-etal-2024-llatrieval,
title = "{LL}atrieval: {LLM}-Verified Retrieval for Verifiable Generation",
author = "Li, Xiaonan and
Zhu, Changtai and
Li, Linyang and
Yin, Zhangyue and
Sun, Tianxiang and
Qiu, Xipeng",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.305",
pages = "5453--5471",
}