Home

Awesome

<p align="center"> <br> <img src="assets/logo_tax.png" style="height: 80px;"> <h2 align="center">Beyond Boundaries: Learning a Universal Entity Taxonomy <br> across Datasets and Languages for Open Named Entity Recognition <br> (B<sup>2</sup>NER) </h2> </p> <p align="center"> <a href="https://github.com/UmeanNever/B2NER/blob/main/LICENSE"><img alt="GitHub license" src="https://img.shields.io/github/license/UmeanNever/B2NER"></a> <a href="http://arxiv.org/abs/2406.11192"><img alt="Paper" src="https://img.shields.io/badge/📖-Paper-red"></a> <a href="https://huggingface.co/datasets/Umean/B2NERD"><img alt="Data" src="https://img.shields.io/badge/📀-Data-blue"></a> <a href="https://huggingface.co/Umean/B2NER-Internlm2-20B-LoRA"><img alt="Data" src="https://img.shields.io/badge/💾-Model-yellow"></a> </p>

We present B2NERD, a cohesive and efficient dataset that can improve LLMs' generalization on the challenging Open NER task, refined from 54 existing English or Chinese datasets. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages.

Feature Highlights:

ModelAvg. F1 on OOD English datasetsAvg. F1 on OOD Chinese datasetsAvg. F1 on OOD multilingual dataset
Previous SoTA69.142.736.6
GPT60.154.731.8
B2NER72.161.343.3

Release 📆

Data (B2NERD)

One of the paper's core contribution is the construction of B2NERD dataset. It's a cohesive and efficient collection refined from 54 English and Chinese datasets and designed for Open NER model training. The preprocessed test datasets (7 for Chinese NER and 7 for English NER) used for Open NER OOD evaluation in our paper are also included in the released dataset to facilitate convenient evaluation for future research.

We provide 3 versions of our dataset.

You can download the data from HuggingFace or Google Drive.
Please ensure that you have the proper licenses to access the raw datasets in our collection.

Below are the datasets statistics and source datasets for B2NERD dataset.

SplitLang.DatasetsTypesSelected Num. in B2NERDRaw Num. in B2NERD_all
TrainEn1911925,403838,648
Zh2122226,504580,513
Total4034151,9071,419,161
TestEn785-6,466
Zh760-14,257
Total14145-20,723
<img src="assets/collected_datasets.png" alt="Collected Datasets" width="400" height="auto">

More dataset information can be found in the Appendix of paper.

Quick Demo With B2NER Models

You can directly download our trained LoRA adapters (less than 50MB) and use them to do demo following instructions in below Sample Usage - Quick Demo subsection.

Model Checkpoints (LoRA Adapters)

Here we provide trained LoRA adapters that can be applied to InternLM2-20B and InternLM2.5-7B, respectively.

We have observed that the official weights and model file of InternLM2 were recently updated. Our LoRA adapters, however, were trained using the initial release of InternLM2 from January 2024. To ensure future compatibility and ease of use, we provide retrained LoRA adapters based on the current version of InternLM2/2.5 (as of July 2024). Please remember to check the version of your backbone model's weights before applying the adapters.

Sample Usage - Quick Demo

Here we show how to use our provided lora adapter to do quick demo with customized input. You can also refer to src/demo.ipynb to see our examples and reuse for your own demo.

import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the base model and tokenizer, use your own path/name
base_model_path = "/path/to/backbone_model"
base_model = AutoModelForCausalLM.from_pretrained(base_model_path, 
                                                  trust_remote_code=True, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(base_model_path, trust_remote_code=True)

# Load and apply the PEFT model, point weight path to your own directory where an adapter_config.json is located
lora_weight_path = "/path/to/adapter"
config = PeftConfig.from_pretrained(lora_weight_path)
model = PeftModel.from_pretrained(base_model, lora_weight_path, torch_dtype=torch.bfloat16)
## English Example ##
# Input your own text and target entity labels. The model will extract entities inside provided label set from text.
text = "what is a good 1990 s romance movie starring kelsy grammer"
labels = ["movie genre", "year or time period", "movie title", "movie actor", "movie age rating"]

instruction_template_en = "Given the label set of entities, please recognize all the entities in the text. The answer format should be \"entity label: entity; entity label: entity\". \nLabel Set: {labels_str} \n\nText: {text} \nAnswer:"
labels_str = ", ".join(labels)
final_instruction = instruction_template_en.format(labels_str=labels_str, text=text)
inputs = tokenizer([final_instruction], return_tensors="pt")
output = model.generate(**inputs, max_length=500)
generated_text = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(generated_text.split("Answer:")[-1])
# year or time period: 1990 s; movie genre: romance; movie actor: kelsy grammer


## 中文例子 ##
# 输入您自己的文本和目标实体类别标签。模型将从文本中提取出在提供的标签集内的实体。
text = "暴雪中国时隔多年之后再次举办了官方比赛,而Moon在星际争霸2中发挥不是很理想,对此Infi感觉Moon是哪里出了问题呢?"
labels = ["人名", "作品名->文字作品", "作品名->游戏作品", "作品名->影像作品", "组织机构名->政府机构", "组织机构名->公司", "组织机构名->其它", "地名"]

instruction_template_zh = "给定实体的标签范围,请识别文本中属于这些标签的所有实体。答案格式为 \"实体标签: 实体; 实体标签: 实体\"。\n标签范围: {labels_str}\n\n文本: {text} \n答案:"
labels_str = ", ".join(labels)
final_instruction = instruction_template_zh.format(labels_str=labels_str, text=text)
inputs = tokenizer([final_instruction], return_tensors="pt")
output = model.generate(**inputs, max_length=500)
generated_text = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(generated_text.split("答案:")[-1])
# 组织机构名->公司: 暴雪中国; 人名: Moon; 作品名->游戏作品: 星际争霸2; 人名: Infi

Code Usage for Training and Inference

We generally follow and update InstructUIE's repo to build our codes.

Requirements

Our main experiments are conducted on a single NVIDIA A100 40G eight-card node. We also use a single H20 eight-card node for some supplementary experiments. The environments are built with the following configurations:

Install depenedencies via

pip install -r requirements.txt

If you met issues when generating inference results on H20 nodes. Try update torch like:

pip3 install --pre torch==2.4.0dev20240610 torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu124

Our environment should be compatible with current latest backbone LLMs like LLama2, InternLM2/2.5 or Qwen2 under our simple testing.

Sample Usage - Batch Inference

Here's an example of using provided lora adapter to infer on the test datasets of B2NERD:

cd B2NER
bash ./scripts/eval_lora_internlm2.sh

The decoded results from inference would save to predict_eval_predictions.jsonl in your output dir.

Evaluation & Sample Predictions

Results/metrics should be automatically computed by our script and can be find in the report folder inside output dir.

You can also manually calculate the metrics for arbitary predictions using

cd src/
python calculate_f1.py --root /path/to/predict_eval_predictions.jsonl

We provide sample predictions results for our 7B and 20B models in /sample_predictions.

Sample Usage - Training

Similar to the inference steps.

cd B2NER
bash ./scripts/train_lora_internlm2_bilingual_full.sh

This script will run training and evaluation sequentially for multiple turns (with different random seeds). We generate predictions for each training epoch. You can find those predictions in the output directory with paths like eval_x/predict_eval_predictions.jsonl. F1 scores are calculated by script automatically.

For each run (random seed), results for each test datasets from predictions at each epoch can be find in the agg.csv in the output directory. You can also manually run the calculation for a specific output dir using

cd src/
python calculate_f1.py --root /path/to/output_dir

Final average results can be computed by averaging the metrics at certain epoch (say last epoch) for multiple runs.

Customized training could be done by changing the TASK_CONFIG_DIR in the training script which specifies the train/test datasets. For instance, you can train a different model for cross-lingual experiments on Multiconer22 dataset by this script /scripts/train_lora_internlm2_crosslingual.sh.

Note that our experiments use the internlm2 weights initially released on January 2024. We found that the official weights of internlm2 have been updated recently which are not fully experimented by us. You may need to adjust some default hyperparameters to achieve best performance.

Extension to Other IE Tasks

Since we follow the instruction and datset format of InstructUIE, the RE and EE datasets can also be combined with B2NERD to train a unified model. Although this is not the primary focus of our work, our code supports such UIE model training.

To do this, simply replace TASK_CONFIG_DIR with a new task config that includes RE and EE tasks and prepare the IE datasets in the required format (i.e., reuse IE_Instructions from InstructUIE). You may refer to the preprocessing codes for RE and EE in b2ner_dataset.py This way, you can leverage the benefits of our B2NERD data in an LLM for IE tasks.

Cite

@article{yang2024beyond,
  title={Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition},
  author={Yang, Yuming and Zhao, Wantong and Huang, Caishuang and Ye, Junjie and Wang, Xiao and Zheng, Huiyuan and Nan, Yang and Wang, Yuran and Xu, Xueying and Huang, Kaixin and others},
  journal={arXiv preprint arXiv:2406.11192},
  year={2024}
}