Home

Awesome

Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition

Modern approaches to Named Entity Recognition (NER) use neural networks (NN) to automatically extract features from text and seamlessly integrate them with sequence taggers in an end-to-end fashion. Word embeddings, which are a side product of pretrained neural language models (LMs), are key ingredients to boost the performance of NER systems. More recently, contextual word embeddings, which adapt according to the context where the word appears, have proved to be an invaluable resource to improve NER systems. In this work, we assess how different combinations of (shallow) word embeddings and contextual embeddings impact NER for the Portuguese Language. We show a comparative study of 16 different combinations of shallow and contextual embeddings and explore how textual diversity and the size of training corpora used in LMs impact our NER results. We evaluate NER performance using the HAREM corpus. Our best NER system outperforms the state-of-the-art in Portuguese NER by 5.99 in absolute percentage points. State-of-The-Art results evaluated by CoNLL-2002 Script.

Results for the Total Scenario (HAREM)

ApproachPrecisionRecallF1
BiLSTM-CRF+FlairBBP74.91%74.37%74.64%
BiLSTM-CRF (Castro, et al.)72.28%68.03%70.33%
CharWNN (dos Santos, et al.)67.16%63.74%65.41%

Results for the Selective Scenario (HAREM)

ApproachPrecisionRecallF1
BiLSTM-CRF+FlairBBP83.38%81.17%82.26%
BiLSTM-CRF (Castro, et al.)78.26%74.39%76.27%
CharWNN (dos Santos, et al.)73.98%68.68%65.41%

Reproduce our tests for NER

Before you begin, you should download the Flair library. Flair is a powerful NLP library with state-of-the-art results. Flair was developed by Zalando Research. You can see all details in this github link.

STEP 1: Download our language model FlairBBP (backward and forward);

STEP 2: Clone this repository;

STEP 3: Install Flair. See how to install here;

STEP 4: Download NILC's Word Embedding. You must download Word2Vec-Skip-Gram with 300 dimensions; Put the file inside the cloned folder;

STEP 5: Run our script python3.6 ner_flair.py

Tagging your portuguese text with our NER model

Tag your text using our best model for NER. The model is formed by FlairBBP + NILC-Word2Vec-Skpg-300d. It is possible to recognize the following categories: PERSON, LOCATION, ORGANIZATION, TIME and VALUE. You need install the last version of Flair.

STEP 1: Download our NER model Download Here!;

STEP 2: Use the pToolNER to labelling your text.

pToolNER = PortugueseToolNER()

pToolNER.loadNamedEntityModel('best-model.pt')

pToolNER.sequenceTaggingOnText(
               rootFolderPath='./PredictablesFiles',
               fileExtension='.txt',
               useTokenizer=True,
               maskNamedEntity=False,
               createOutputFile=True,
               outputFilePath='./TaggedTexts',
               outputFormat='plain',
               createOutputListSpans=True
               )

Alternative use (We strongly recommend you to use the pToolNER!):

STEP 1: Download our NER model Download Here!;

STEP 2: Clone this repository;

STEP 3: Run our script python3.6 tagging_ner.py [input_file_name.txt] [output_file_name.txt] [mode] modes:

Language Models

Flair Embeddings - FlairBBP

You can download our Flair Embeddings models (FlairBBP) in the following links:

Word Embeddings

You can download our Word Embedding models in the following links, note that all models were trained in 300 dimensions:

AlgorithmArchitectureDownloads
Word2VecSkip-GramWord2Vec_skpg_300d
Word2VecCBOWWord2Vec_cbow_300d
FastTextSkip-GramFasttext_skpg_300d
FastTextCBOWFasttext_cbow_300d

NILC Word Embeddings

You can download the Word Embeddings provided by NILC in the following link: http://nilc.icmc.usp.br/embeddings

Language Models Corpora

BlogSet-BR

BlogSet-BR is a large corpus built from millions of sentences taken from Brazilian Portuguese web blogs.

brWaC

brWaC is another portuguese large corpus.

ptwiki-20190301

ptwiki-20190301 is a corpus formed by texts from wikipedia in Portuguese.

Language Model Corpora Size Details (after pre-processing):

CorpusSentencesTokens
brWaC127,272,1092,930,573,938
BlogSet-BR58,494,0901,807,669,068
ptwiki-201903017,053,954162,109,057
All Corpora192,820,1534,900,352,063

Citing our Paper

@inproceedings{santos2019assessing,
  author    = {Joaquim Santos and
               Bernardo Consoli and
               Cicero dos Santos and
               Juliano Terra and
               Sandra Collonini and
               Renata Vieira},
  title     = {Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition},
  booktitle = {Proceedings of the 8th Brazilian Conference on Intelligent Systems},
  pages     = {437--442},
  year      = {2019}
}