Home

Awesome

RA-IT-NER

This is the github repository for the paper: Retrieval Augmented Instruction Tuning for Open NER with Large Language Models.

Introduction

<img style="width:35%;" align="right" src=assets/method.png/>

Installation

We run this repository based on the following dependencies:

python==3.11.5
pytorch==2.3.0
transformers=4.41.2
peft==0.11.1
openai==1.21.2
flash_attn==2.5.9
vllm==0.4.3

You will also need these dependencies:

numpy tqdm rich datasets Jinja jieba pandas pyarrow

Data

Chinese OpenNER Data Construction

We release Sky-NER, an instruction-tuning dataset constructed for Chinese openNER, based on the SkyPile corpus. We followed the recipe in UniversalNER to construct Sky-NER.

We also release the code of our data construction pipeline in data_process.

Training and Evaluation Datasets

We provide the processed training and evaluation data used in our paper at the Google Drive, which also includes the sampled 5K and 10K training datasets used in the paper. You can download and unzip the data pacakge and put the content in the data folder.

The code for generating RA-IT data and preprocessing the benchmarks can all be found in data_process.

Models

We release our models fine-tuned with the proposed RA-IT approach, RA-IT-NER-8B and RA-IT-NER-zh-7B, which are trained on the English NER dataset Pile-NER and the Chinese NER dataset Sky-NER respectively.

ModelLanguageBackboneTraining data
RA-IT-NER-8BEnglishLlama-3-8BPile-NER
RA-IT-NER-zh-7BChineseQwen-1.5-7BSky-NER

Demo

The inferece code are based on vllm.

Please download our fine-tuned models RA-IT-NER-8B and RA-IT-NER-zh-7B and put them in your model directory before running the demos.

The following commands for running demos can be found in the bash scripts in serve.

Our model RA-IT-NER supports inference with and without RAG.

Gradio Web UI

Use the following command to launch a Gradio demo locally:

models=${your_model_dir}
python src/serve/gradio_server.py \
    --model_path ${models}/RA-IT-NER-8B \
    --tensor_parallel_size 1 \
    --max_input_length 2048 \
    --language en

CLI Inference

Use the following command to do inference with vllm:

models=${your_model_dir}
python src/serve/cli.py \
    --model_path ${models}/RA-IT-NER-8B \
    --tensor_parallel_size 1 \
    --max_input_length 2048 \
    --language en

Use the following command to do inference with HuggingFace Transformers:

models=${your_model_dir}
python src/serve/hf.py \
    --model_path ${models}/RA-IT-NER-8B \
    --max_new_tokens 256 \
    --language en

Finetuning

All code and bash scripts of finetuning and evaluation can be found in folder llm_tuning.

We use Llama-Factory to fine-tune our models.

Please prepare the data and the backbone model before finetuning: Generate the RA-IT datasets using code in here, or download the processed RA-IT training data from google drive. Download the base model from huggingface.

Run the bash scripts for finetuning:

# Training with RA-IT
sh src/llm_tuning/bash_scripts/train_skyner_RA_IT.sh
# Training with Vanilla IT
sh src/llm_tuning/bash_scripts/train_skyner_vanilla_IT.sh

We also provide the scripts of training with various retrieval strategies in the folder llm_tuning:

Evaluation

All code and bash scripts of finetuning and evaluation can be found in folder llm_tuning.

Please prepare the benchmark data before evaluation: Download the processed benchmark data from google drive. Or process new benchmarks with the code in data_process.

Our evaluation code is adapted from UniversalNER.

Run the bash scripts for evaluating:

# Evaluation of RA-IT model
sh src/llm_tuning/bash_scripts/eval_skyner_RA_IT.sh
# Evaluation of Vanilla IT model
sh src/llm_tuning/bash_scripts/eval_skyner_vanilla_IT.sh

For inference with various retrieval strategies, see more commands in the script eval_skyner_RA_IT.sh. Uncomment the commands of the retrieval strategies you'd like to evaluate and then run the script.

Acknowledgement

This repository is built based upon the excellent work of UniversalNER and Llama-Factory. The corpus data preprocessing partially referenced MINI_LLM. We thank them for their open-sourced contributions.

Citation

@misc{xie2024retrievalaugmentedinstructiontuning,
      title={Retrieval Augmented Instruction Tuning for Open NER with Large Language Models}, 
      author={Tingyu Xie and Jian Zhang and Yan Zhang and Yuanyuan Liang and Qi Li and Hongwei Wang},
      year={2024},
      eprint={2406.17305},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
      url={https://arxiv.org/abs/2406.17305}, 
}