Home

Awesome

<h1 align="center">ParsNER 🦁</h1>

<br/><br/>

Introduction

This repo contains all existing pretrained models that are fine-tuned for the Named Entity Recognition (NER) task. These models trained on a mixed NER dataset collected from ARMAN, PEYMA, and WikiANN that covered ten types of entities:

Dataset Information

RecordsB-DATB-EVEB-FACB-LOCB-MONB-ORGB-PCTB-PERB-PROB-TIMI-DATI-EVEI-FACI-LOCI-MONI-ORGI-PCTI-PERI-PROI-TIM
Train291331423148714001391941715926355123471855150194750182421411810591957957376991914332
Valid514226725325023621002651642173317193737993877172703260101138230335
Test60494072562482886983216942646318435688884088582633967141170729678

Download You can download the dataset from here

Evaluation

The following tables summarize the scores obtained by pretrained models overall and per each class.

Modelaccuracyprecisionrecallf1
Bert0.9950860.9534540.9611130.957268
Roberta0.9948490.9498160.9602350.954997
Distilbert0.9945340.9463260.955040.950663
Albert0.9934050.9389070.9439660.941429

Bert

numberprecisionrecallf1
DAT4070.8606360.8648650.862745
EVE2560.9695820.9960940.982659
FAC2480.9761900.9919350.984000
LOC28840.9702320.9719140.971072
MON980.9052630.8775510.891192
ORG32160.9391250.9546020.946800
PCT941.0000000.9680850.983784
PER26450.9652440.9659740.965608
PRO3180.9814811.0000000.990654
TIM430.6923080.8372090.757895

Roberta

numberprecisionrecallf1
DAT4070.8448690.8697790.857143
EVE2560.9481481.0000000.973384
FAC2480.9575291.0000000.978304
LOC28840.9654220.9681000.966759
MON980.9375000.9183670.927835
ORG32160.9436620.9583330.950941
PCT941.0000000.9680850.983784
PER26460.9570300.9595620.958294
PRO3180.9636361.0000000.981481
TIM430.7391300.7906980.764045

Distilbert

numberprecisionrecallf1
DAT4070.8120480.8280100.819951
EVE2560.9550560.9960940.975143
FAC2480.9725491.0000000.986083
LOC28840.9684030.9670600.967731
MON980.9255320.8877550.906250
ORG32160.9320950.9518030.941846
PCT940.9368420.9468090.941799
PER26450.9598180.9572780.958546
PRO3180.9635260.9968550.979907
TIM430.7608700.8139530.786517

Albert

numberprecisionrecallf1
DAT4070.8206390.8206390.820639
EVE2560.9368030.9843750.960000
FAC2480.9253731.0000000.961240
LOC28840.9608180.9608180.960818
MON980.9139780.8673470.890052
ORG32160.9208920.9375000.929122
PCT940.9468090.9468090.946809
PER26440.9600000.9440240.951945
PRO3180.9429430.9874210.964670
TIM430.7804880.7441860.761905

How To Use

You use this model with Transformers pipeline for NER.

Installing requirements

pip install sentencepiece
pip install transformers

How to predict using pipeline

from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification  # for pytorch
from transformers import TFAutoModelForTokenClassification  # for tensorflow
from transformers import pipeline

# model_name_or_path = "HooshvareLab/bert-fa-zwnj-base-ner"  # Roberta
# model_name_or_path = "HooshvareLab/roberta-fa-zwnj-base-ner"  # Roberta
model_name_or_path = "HooshvareLab/distilbert-fa-zwnj-base-ner"  # Distilbert
# model_name_or_path = "HooshvareLab/albert-fa-zwnj-base-v2-ner"  # Albert

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Pytorch
# model = TFAutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Tensorflow

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "در سال ۲۰۱۳ درگذشت و آندرتیکر و کین برای او مراسم یادبود گرفتند."

ner_results = nlp(example)
print(ner_results)

Models

Hugging Face Model Hub

Training

All models were trained on a single NVIDIA P100 GPU with following parameters.

Arguments

"task_name": "ner"
"model_name_or_path": model_name_or_path
"train_file": "/content/ner/train.csv"
"validation_file": "/content/ner/valid.csv"
"test_file": "/content/ner/test.csv"
"output_dir": output_dir
"cache_dir": "/content/cache"
"per_device_train_batch_size": 16
"per_device_eval_batch_size": 16
"use_fast_tokenizer": True
"num_train_epochs": 5.0
"do_train": True
"do_eval": True
"do_predict": True
"learning_rate": 2e-5
"evaluation_strategy": "steps"
"logging_steps": 1000
"save_steps": 1000
"save_total_limit": 2
"overwrite_output_dir": True
"fp16": True
"preprocessing_num_workers": 4

Cite

Please cite this repository in publications as the following:

@misc{ParsNER,
  author = {Hooshvare Team},
  title = {Pre-Trained NER models for Persian},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/hooshvare/parsner}},
}

Questions?

Post a Github issue on the ParsNER Issues repo.