Home

Awesome

GLEN: Generative Retrieval via Lexical Index Learning (EMNLP 2023)

This is the official code for the EMNLP 2023 paper "GLEN: Generative Retrieval via Lexical Index Learning".

Overview

GLEN (Generative retrieval via LExical Ndex learning) is a generative retrieval model that learns to dynamically assign lexical identifiers using a two-phase index learning strategy.

GLEN

The poster and the slide files are available at each link: poster and slide. We also provide blog posts (Korean) at here. Please refer to the paper for more details: arXiv or ACL Anthology.

Environment

We have confirmed that the results are reproduced successfully in python==3.8.12, transformers==4.15.0, pytorch==1.10.0 with cuda 12.0. Please create a conda environment and install the required packages with requirements.txt.

# Clone this repo
git clone https://github.com/skleee/GLEN.git
cd GLEN

# Set conda environment
conda create -n glen python=3.8
conda activate glen

# Install tevatron as editable
pip install --editable .

# Install dependencies 
pip install -r requirements.txt
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

Optionally, you can also install GradCache to gradient cache feature during training ranking-based ID refinement by:

git clone https://github.com/luyug/GradCache
cd GradCache
pip install .

Dataset

Datasets can be downloaded from: NQ320k, MS MARCO Passage Ranking set, BEIR.
After downloading each folder, unzip it into the data folder. The structure of each folder is as follows.

data
├── BEIR_dataset
│   ├── arguana
│   └── nfcorpus
├── nq320k
└── marco_passage

Training

The training process consists of two phases: (1) Keyword-based ID assignment and (2) Ranking-based ID refinement. In the /examples folder, we provide GLEN code for each phase: glen_phase1, glen_phase2. Please refer to src/tevatron for the trainer. Run the scripts to train GLEN from the scratch for NQ320k or MS MARCO.<br>

NQ320k

# (1) Keyword-based ID assignment
sh scripts/train_glen_p1_nq.sh
# (2) Ranking-based ID refinement
sh scripts/train_glen_p2_nq.sh

MS MARCO

# (1) Keyword-based ID assignment
sh scripts/train_glen_p1_marco.sh
# (2) Ranking-based ID refinement
sh scripts/train_glen_p2_marco.sh

You can directly download our trained checkpoints for each stage from the following link: NQ320k, MS MARCO

Evaluation

The evaluation process consists of two stages: (1) Document processing via making document identifiers and (2) Query processing via inference.

GLEN Run the scripts to evalute GLEN for each dataset.<br>

NQ320k

sh scripts/eval_make_docid_glen_nq.sh
sh scripts/eval_inference_query_glen_nq.sh

MS MARCO

sh scripts/eval_make_docid_glen_marco.sh
sh scripts/eval_inference_query_glen_marco.sh

BEIR

# Arguana
sh scripts/eval_make_docid_glen_arguana.sh
sh scripts/eval_inference_query_glen_arguana.sh
# NFCorpus
sh scripts/eval_make_docid_glen_nfcorpus.sh
sh scripts/eval_inference_query_glen_nfcorpus.sh 

Acknowledgement

Our code is mainly based on Tevatron. Also, we learned a lot from NCI, Transformers, and BEIR. We appreciate all the authors for sharing their codes.

Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{lee-etal-2023-glen,
    title = "{GLEN}: Generative Retrieval via Lexical Index Learning",
    author = "Lee, Sunkyung  and
      Choi, Minjin  and
      Lee, Jongwuk",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.477",
    doi = "10.18653/v1/2023.emnlp-main.477",
    pages = "7693--7704",
}

Contacts

For any questions, please contact the following authors via email or feel free to open an issue 😊