Home

Awesome

OLID-BR

Quality Gate Status Python 3.10

Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR) is a collection of Portuguese text with annotations for several NLP tasks related to toxicity/offensive language.

See the Dataset documentation for more information.

Technical details

This repository contains the source code to prepare, build, and publish the OLID-BR dataset.

The repository is structured as follows:

<details><summary>Architecture</summary> <p>

</p> </details>

Running Notebooks

You must define the following environment variables to run the notebooks:

<details><summary>Environment Variables</summary> <p>
VariableDescriptionDefaultRequired
AWS_ACCESS_KEY_IDAWS Access Key IDNoneOptional
AWS_S3_BUCKET_PREFIXAWS S3 Bucket PrefixNoneRequired
AWS_S3_BUCKETAWS S3 BucketNoneRequired
AWS_SECRET_ACCESS_KEYAWS Secret Access KeyNoneOptional
FILTER_TOXIC_COMMENTSFilter Toxic CommentsTrueOptional
HUGGINGFACE_HUB_TOKENHuggingFace Hub TokenNoneRequired
KAGGLE_KEYKaggle KeyNoneRequired
KAGGLE_USERNAMEKaggle UsernameNoneRequired
LOG_LEVELLog levelINFOOptional
PERSPECTIVE_API_KEYPerspective API KeyNoneRequired
PERSPECTIVE_THRESHOLDPerspective Threshold0.5Optional
TWITTER_ACCESS_TOKENTwitter Access TokenNoneRequired
TWITTER_ACCESS_TOKEN_SECRETTwitter Access Token SecretNoneRequired
TWITTER_CONSUMER_KEYTwitter Consumer KeyNoneRequired
TWITTER_CONSUMER_SECRETTwitter Consumer SecretNoneRequired
TWITTER_MAX_TWEETSTwitter Max Tweets or repliesNoneRequired
YOUTUBE_API_KEYYouTube API KeyNoneRequired
YOUTUBE_MAX_COMMENTS_PER_VIDEOYouTube Max Comments per videoNoneOptional

The Jupyter Notebooks uses a .env file to read the environment variables.

</p> </details>

If you are running the notebooks on Google Colab, you need to run the following commands:

!git clone https://github.com/DougTrajano/olid-br.git
!mv olid-br/* .
!rm -rf olid-br
!pip install -r requirements.txt

The Google Colab uses Python 3.7 which means that the numpy, pandas, and scikit-learn versions in the requirements.txt are not compatible, please update the requirements.txt file to the following versions:

numpy~=1.23.1
pandas~=1.3.5
scikit-learn~=1.0.2

Install dependencies

You can install the dependencies by running the following command:

pip install -r requirements.txt

Changelog

See the GitHub Releases page for a history of notable changes to this project.

License

The source code is licensed under the Apache 2.0 License.

The dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).