Awesome
Portuguese-NLP
List of resources and tools developed with focus on Portuguese.
Datasets
- #PraCegoVer - multi-modal dataset with Portuguese captions based on posts from Instagram.
- 18th-century Portuguese medical texts
- AG_news pt - automatic translation of the AG's corpus of news articles.
- Alpaca data pt-br - Stanford Alpaca dataset translated into Brazilian Portuguese using the Helsinki-NLP/opus-mt-tc-big-en-pt model.
- AspectBR - Aspect-based annotated dataset of web consumer reviews.
- ASSIN - a dataset with semantic similarity score and entailment annotations. (HuggingFace)
- ASSIN 2 - sequence of ASSIN. (HuggingFace)
- Automated Essay Score (AES) ENEM Dataset - Benchmark for automatic essay scoring in Portuguese (HuggingFace)
- Aya Dataset PT - CohereForAI Aya Dataset filtrado para português (PT).
- BlogSet-BR - a collection of posts gathered from Blogspot platform written by Brazillian users.
- BLUEX - A benchmark based on Brazilian Leading Universities Entrance eXams.
- BoolQ - tradução automática do BoolQ.
- br-quad-2.0 - Stanford Question Answering Dataset (SQuAD) 2.0 translated to Brazilian Portuguese (PT-BR) language.
- Brands.Br - a Portuguese Reviews Corpus
- Brazilian Court Decisions - collection of 4043 Ementa (summary) court decisions and their metadata from the Tribunal de Justiça de Alagoas (TJAL), the State Supreme Court of Alagoas (Brazil).
- Brazilian E-Commerce - Brazilian E-Commerce Public Dataset by Olist store.
- Brazilian Headlines Sentiments - Dataset containing sentiment analysis of Brazilian news agencies headlines.
- Brazilian Portuguese Literature Corpus - 3.7 million word corpus of Brazilian literature published between 1840-1908.
- Brazilian Portuguese Narrative Essays Dataset - Dataset for Automatic Essay Scoring of Brazilian Portuguese Narrative Essays.
- Brazilian Portuguese Sentiment Analysis Datasets.
- Brazilian TCU's judgments - Judgments of Federal Court of Accounts - Brazil (TCU).
- BrWaC - Brazilian Portuguese Web as Corpus.
- BrWac2Wiki - a dataset for multi-document summarization in Portuguese.
- B2W-Reviews01 - product reviews.
- Canarim - A Large-Scale Dataset of Web Pages in the Portuguese Language (huggingface)
- Carolina - Corpus Geral do Português Brasileiro Contemporâneo (huggingface).
- Capes - parallel corpus of theses and dissertations abstracts in English and Portuguese.
- CC100-Portuguese - Created by Conneau & Wenzek et al. at 2020. This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository.
- CETENFolha - news from the newspaper Folha de S. Paulo.
- CHAVE - collection for Information Retrieval and Question Answering.
- CINTIL Corpus - a linguistically interpreted corpus of Portuguese.
- ClinicalNER - Clinical Named Entity Recognition in Portuguese.
- Complexidade Textual para Estágios Escolares do Sistema Educacional Brasileiro.
- CORAA - dataset for Automatic Speech Recognition.
- CORAA SER - Emotion Recognition from Brazilian Portuguese Informal Spontaneous Speech.
- CrawlPT_dedup - CrawlPT (deduplicated) is composed by three corpora: brWaC, C100-PT, OSCAR-2301.
- CSTNews - a corpus with 50 clusters of news texts with their multi-document summaries, as well as several discourse and semantic annotations.
- C-ORAL-BRASIL - This project is dedicated to the study of Brazilian Portuguese spontaneous speech and, more broadly, to the compilation of spoken corpora.
- DANTEStocks - Corpus of stock market tweets written in Brazilian Portuguese and annotated with named entities according to HAREM's taxonomy.
- DEEPAGÉ - Answering Questions in Portuguese about the Brazilian Environment.
- DNLT-BP - Datasets of Neuropsychological Language Tests in Brazilian Portuguese.
- ENEM Challenge - Consists of the writing of an essay and an objective part containing 180 multiple choice questions.
- ENEM-2022 and ENEM-2023 - These projects encompass all multiple-choice questions from the last two editions of the Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities.
- Essay-BR - Essay-BR: a corpus of essays for the Brazilian Portuguese language.
- Extended Essay-BR - Extended version of the Essay-BR corpus.
- FACTCK.BR - A dataset to study Fake News in Portuguese.
- FactNews - dataset to predict sentence-level factuality of news reporting.
- fake voices - deepfakes in Brazilian Portuguese created with XTTS model.
- Fake.Br - aligned true and fake news written in Brazilian Portuguese (Hugginface).
- Central_de_fatos - (Huggingface).
- FakeNewsSet - (HuggingFace).
- Fakepedia-Corpus - fake news dataset.
- FakeRecogna - dataset comprised of real and fake news (Huggingface).
- FakeWhatsApp.Br - An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.
- FKTC - FaKe news Text Collections.
- Floresta Sintá(c)tica - treebank for Portuguese.
- HAREM first - evaluation contest for named entity recognizers in Portuguese.
- HAREM second - evaluation contest for named entity recognizers in Portuguese.
- HateBR - large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.
- Historical Portuguese Corpora - tools and resources for manipulation of historical corpora and management of historical dictionaries.
- IMDB pt - Tradução atomática do IMBD.
- InferBR - Natural Language Inference dataset.
- Iudicium Textum Dataset - contains legal documents created by Brazilian Federal Supreme Court in its integral composition (paper).
- LeNER-Br - a Dataset for Named Entity Recognition in Brazilian Legal Text.
- LegalPT_dedup - LegalPT (deduplicated) aggregates the maximum amount of publicly available legal data in Portuguese.
- Lex2Kids - lexicon in Portuguese most heard by children.
- Mac-Morpho - Brazilian Portuguese texts annotated with part-of-speech tags.
- MilkQA - a dataset of dense questions for the task of answer selection.
- Minutes of Central Bank of Brazil - Minutes of the Monetary Policy Committee of the Central Bank of Brazil.
- NER in Brazilian Portuguese tweets - Twitter messages in pt-br annotated for the entities PER, LOC and ORG.
- NERDE - Documents from CADE's jurisprudence annotated for the entities ORG, PER, TEMPO, LOC, LEG (legislation), DOCS (documents), VALOR.
- News-Crawl-PT - Monolingual News Crawl used for WMT.
- News of the site Folha de São Paulo - news of the Brazilian Newspaper Folha de São Paulo.
- News published in Brazil - news compilation of the Globo group.
- OAB exams - Brazilian version of the BAR exam (USA) (HuggingFace).
- Parallel Corpora from Revista Pesquisa FAPESP - Portuguese-English and Portuguese-Spanish bilingual collections of the online issues of the scientific news Brazilian magazine Revista Pesquisa FAPESP.
- NURC-SP
- Pirá - A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean.
- PL-corpus - part of the UlyssesNER-Br, a corpus of Brazilian Legislative Documents for NER with quality baselines.
- PLUE - Portuguese translation of the GLUE benchmark and Scitail dataset.
- POeTiSA - POrtuguese processing - Towards Syntactic Analysis and parsing.
- politiquices - Datasets related with the politiquices.pt project.
- PorSimplesSent - of aligned sentences pairs to investigate sentence readability assessment.
- PortiLexicon-UD - a lexicon for Brazilian Portuguese according to Universal Dependencies.
- Portuguese-Hate-Speech-Dataset - Portuguese dataset for hate speech detection composed of 5,668 tweets with binary annotations (i.e. 'hate' vs. 'no-hate') (HuggingFace)
- Portuguese Legal Sentences - Collection of Legal Sentences from the Portuguese Supreme Court of Justice.
- Portuguese Presidential Elections - This dataset contains tweets and users mostly from the Portuguese Twittersphere.
- PraCegoVer - multi-modal dataset containing images associated to Portuguese captions based on posts from Instagram.
- Priberam Fine-Grained Opinion Corpus - a Portuguese fine-grained dependency opinion mining corpus.
- Propbank - Contains instances annotated with semantic role labels (SRL).
- Projeto ACDC - Internet Access to Corpora.
- Puntuguese - A Corpus of Puns in Portuguese with Micro-editions (HuggingFace)
- QA-Portuguese - Adaptation from MQA dataset Portuguese split (QA entailment pairs).
- Quati - This dataset aims to support Brazilian Portuguese (pt-br) Information Retrieval (IR) systems development, providing document passagens originally created in pt-br, as well as queries (topics) created by native speakers.
- REBEL-Portuguese - Datasets de relações a partir da Wikipedia.
- ReLi - REsenha de LIvros.
- RePro: A Benchmark Dataset for Opinion Mining for Brazilian Portuguese - A Benchmark Dataset for Opinion Mining for Brazilian Portuguese. (HuggingFace)
- Rhetalho - corpus annotated with Daniel Marcu's RSTTool.
- SemClinBr - multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.
- SESAME - corpus for NER in portuguese.
- SIGARRA News Corpus - SIGARRA information system at the University of Porto.
- SIMPLEX-PB - A Lexical Simplification Database and Benchmark for Portuguese.
- SIMPLEX-PB-2.0 - improved version of SIMPLEX-PB.
- SIMPLEX-PB-3.0 - new version of SIMPLEX-PB.
- Spotify Subset - classifying language variations in Brazilian Portuguese
- SQUAD-PT v1.1 - Portuguese translation of the SQuAD dataset.
- SQUAD-PT v1.1-pt-br - Brazilian Portuguese translation of the SQuAD dataset, translated by Deep Learning Brasil.
- SQUAD-PT v2.0 - Portuguese translation of SQuAD 2.0 dataset.
- SST-2 pt - Automatic translation of the Stanford Sentiment Treebank.
- TeMário - news texts and the corresponding human summaries for summarization purposes.
- Textual Complexity Corpus - Textual Complexity Corpus for School Internships in the Brazilian Educational System.
- ToLD-Br - Toxic Language Detection in Social Media for Brazilian Portuguese (github).
- TTS-Portuguese Corpus - Text To Speech Portuguese.
- TweetSentBR - Tweets in Brazilian Portuguese.
- Tweets for Sentiment Analysis.
- UD_Portuguese-Bosque - Universal Dependencies (UD) Portuguese treebank.
- UD_Portuguese-CINTIL - Universal Dependencies (UD) Portuguese treebank.
- UD_Portuguese-GSD - Universal Dependencies (UD) Portuguese treebank.
- UD_Portuguese-PetroGold - Universal Dependencies (UD) Portuguese treebank.
- UD_Portuguese-PUD - Universal Dependencies (UD) Portuguese treebank.
- UlyssesNER-Br - Corpus of Brazilian Legislative Documents for Named Entity Recognition
- UTLCorpus - a corpus of online reviews in Brazilian Portuguese annotated with helpfulness classification.
- Winograd Schema Challenge - Solver for the Portuguese-based Winograd Schema Challenge.
- WizardVicuna-PTBR-Instruct-Clean - Wizard Vicuna PT-Br Instruct Clean dataset.
Multilingual datasets
- A Multilingual Dataset for Investigating Stereotypes and Negative Attitudes Towards Migrant Groups in Large Language Models
- askD - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
- English-Portuguese Sentences - English-Portuguese Sentences from the Tatoeba Project.
- EUR-Lex - multilingual corpus in all the official languages of the European Union.
- Europarl - European Parliament Proceedings Parallel Corpus 1996-2011.
- Europarl-ST - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
- mc4 - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
- mfaq - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
- MKQA - Multilingual Knowledge Questions & Answers (github).
- MQA - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
- MMARCO - Multilingual version of the MS MARCO passage ranking dataset.
- mRobust - Multilingual version of the TREC 2004 Robust passage ranking dataset
- MultiCoNER - a large multilingual dataset for Named Entity Recognition.
- MuST-C - multilingual speech translation corpus.
- OpenSubtitles - collection of translated movie subtitles.
- OSCAR - Open Super-large Crawled Aggregated coRpus.
- Tatoeba - a large database of sentences and translations.
- TED2020 - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
- TSAR-2022-Shared-Task - TSAR2022 Shared Task on Lexical Simplification.
- WikiANN - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
- WikiLingua - Multilingual abstractive summarization dataset extracted from WikiHow.
- WikiMatrix - Parallel Sentences in 1620 Language Pairs from Wikipedia.
- Wikiner - Learning multilingual named entity recognition from Wikipedia.
- WikiNEuRal - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
- Wikipedia - Wikipedia dataset containing cleaned articles of all languages.
- XFORMAL - A Benchmark for Multilingual Formality Style Transfer.
- XLSUM - 1.35 million professionally annotated article-summary pairs from BBC.
Lexicon
- BATS-PT - manual translation of the lexicographic portion of the Bigger Analogy Test Set (BATS) to Portuguese
- br.ispell - Ispell dictionary for brazilian portuguese (github).
- Conceptnet - an open, multilingual knowledge graph.
- DicSin - Dictionary of synonyms and antonyms.
- lexiconPT - R package that provides lexicons for Portuguese Text Analysis.
- lexicons - Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc.
- LIWC - Linguistic Inquiry and Word Count (dictionary)
- Onto.PT - Ontologia Lexical para o Português.
- OpenWordnet-PT - an open access wordnet for Portuguese (site).
- OpLexicon - a sentiment lexicon for the Portuguese language.
- palavras - Word list of Brazillian Portuguese.
- PAPEL.
- pt-br - Wordlist, verbs, conjugations, term frequencies.
- PT-LKB - Large Portuguese Lexical-Semantic Knowledge Base
- PULO - Portuguese Unified Lexical Ontology.
- SentiLex-PT - a sentiment lexicon for Portuguese.
- Stopwords - Portuguese stopwords collection.
- Tep2.
- Unitex-PB - lexical resources.
- VaLexPB - a lexicon of Brazilian Portuguese verb valences.
- VerbNet.Br 1.0 - verbal lexicon of Brazilian Portuguese.
- wikidict-dsl-pt - Wikidata Bilingual DSL Dictionaries.
- Wordnetaffectbr - vocabulary of emotions words.
- Wordnet.Br - Portuguese WordNet.
Models
- Albertina PT-BR - It is an encoder of the BERT family for the Portuguese language - the American variant from Brazil.
- Albertina PT-PT - It is an encoder of the BERT family for the Portuguese language - the European variant from Portugal.
- Alpaca-LoRA-PTBR - Low-Rank LLaMA Instruct-Tuning.
- BART - BART pre-treinado em português.
- BERTimbau - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment (Github).
- BioBERTpt - fine-tuned BERT models trained on the clinical domain for Portuguese language (Github).
- Cabrita - A portuguese finetuned instruction LLaMA (Github).
- DeBERTinha - A DeBERTa V3 XSmall adapted to the Brazilian Portuguese language (Github).
- Electra - Electra model trained on BRWAC.
- Gervasio-PT-BR - It is a decoder of the GPT family for the Portuguese language - the American variant from Brazil.
- Gervasio-PT-PT - It is a decoder of the GPT family for the Portuguese language - the European variant from Portugal.
- GlórIA 1.3B - A Portuguese European-focused Large Language Model (HuggingFace)
- GPT2 small - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
- GPT-Neo small - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
- GPT2-Bio-PT - a biomedical finetuned version from GPorTuguese-2 (Github).
- NERDE-base - BERTimbau finetuned to NER on Judicial Documents.
- roberta-pt-br
- RoBERTaCrawlPT-base - RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora
- RoBERTaLexPT-base - Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora
- Sabiá - Sabiá-7B is Portuguese language model developed by Maritaca AI.
- Sabiá 2 - Language model trained on Portuguese text, especially in the Brazilian domain.
- T5 - T5 model on Brazilian Portuguese data.
- tgf-xlm-roberta-base-pt-br (Github)
- Wav2vec - Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1.
Multilingual Models
- Bloom - BigScience Large Open-science Open-access Multilingual Language Model.
- mBert - Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
- mDeBERTa
- mGPT - Multilingual GPT model. An autoregressive GPT-like model.
- mMiniLM - mMiniLM-L6-v2 Reranker finetuned on mMARCO
- mT5 - Multilingual T5. A massively multilingual pre-trained text-to-text transformer.
- XLM-RoBERTa - XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
- LaBSE - Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.
Word Embeddings
- fastText - Multi-lingual word vectors.
- LASER - Language-Agnostic SEntence Representations.
- NILC-Embeddings - Word embeddings trained in Portuguese by USP.
- MUSE - Multilingual Unsupervised and Supervised Embeddings.
- word vectors - Pre-trained word vectors of 30+ languages.
Metrics
- Coh-Metrix-Port - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
- NILC-Metrix - it gathers the metrics developed over more than a decade in NILC Lab.
Leaderboards
- Open PT LLM Leaderboard - Open PT LLM Leaderboard aims to provide a benchmark for the evaluation of Large Language Models (LLMs) in the Portuguese language across a variety of tasks and datasets.
Frameworks
Institutions
- Brasileiras em PLN.
- HAILab-PUCPR - A pioneering research group aiming to develop solutions for health care using Natural Language Processing and Machine Learning.
- Linguateca.
- NILC.
- NLPortuguês - Devoted to creating NLP courses in brazilian portuguese.
- NLX-Group.
- PLN PUCRS.
Tools
- Apertium-por - Apertium linguistic data for Portuguese.
- Autocorrect - Spelling corrector in python.
- BrGram - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
- Dicio API - Portuguese dictionary API.
- dict-pt-br - dictionary for Brazilian Portuguese.
- Languagetool - Style and Grammar Checker for 25+ Languages.
- LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language.
- LexML Parser - parser for legal documents.
- LX parser - statistical constituency parser for Portuguese.
- metaphone-ptbr - Metaphone algorithm for the Portuguese language.
- mlconjug3 - a Python library to conjugate verbs in Portuguese and other languages.
- MorphoBr - Resources for morphological analysis of Portuguese.
- OpCluster - Automatic extraction and clustering of fine-grained opinions.
- Phonemizer - Simple text to phones converter for multiple languages.
- PorGram - Open source computational grammar for Portuguese in the HPSG formalism.
- pymetaphone-br - Metaphone algorithm package for the Portuguese language.
- pysentimiento - Multilingual toolkit for Sentiment Analysis and Social NLP tasks.
- pyspellchecker - Multilingual Spell Checking.
- RBAMR - A Rule-Based AMR Parser for Portuguese.
- Verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.
Other lists
- Annotated Semantic Relationships Datasets
- Linguistic datasets - Linguistic Datasets for Portuguese.
- NER-datasets for Portuguese
- NILC
- NILC 2
- NILC 3
- Opinando - Opinion Mining for Portuguese.
- Portuguese dataset List
Other links
- OPUS - OPUS is a growing collection of translated texts from the web.
- Statistical and Neural Machine Translation.