Home

Awesome

SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts

This is the implementation of our paper "SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts". You can find the paper here.

Abstract

In this paper, we propose an annotated sentiment analysis dataset made of informally written Bangla texts. This dataset comprises public comments on news and videos collected from social media covering 13 different domains, including politics, education, and agriculture. These comments are labeled with one of the polarity labels, namely positive, negative, and neutral. One significant characteristic of the dataset is that each of the comments is noisy in terms of the mix of dialects and grammatical incorrectness. Our experiments to develop a benchmark classification system show that hand-crafted lexical features provide superior performance than neural network and pretrained language models.

Authors

<sup>1</sup> Shahjalal University of Science and Technology, Bangladesh <br> <br> <sup>2</sup> University of Alberta, Canada <br> <br> <sup>3</sup> Amazon Alexa AI, USA <br> <br> <sup>4</sup> Fordham University, USA

SentNoB Dataset is available here

List of files

Files Format

Column TitleDescription
DataSocial media comment
Label0, 1 or 2 . '0' for neutral, '1' for positive and '2' for negative

INSTALLATION

Requires the following packages:

It is recommended to use virtual environment packages such as virtualenv. Follow the steps below to setup the project:

Usage

  1. Download the SentNoB dataset from here
  2. Unzip the folder
  3. Ensure the folder name is "SentNoB Dataset"
  4. Go to data_processing folder and run python preprocess.py to obtain the preprocessed data.

Feature-Based Experiments

Neural Network Experiments

Random Initialize
FastText

mBert

Bibtex

@inproceedings{islam2021sentnob,
  title={SentNoB: A Dataset for Analysing Sentiment on Noisy Bangla Texts},
  author={Islam, Khondoker Ittehadul and Kar, Sudipta and Islam, Md Saiful and Amin, Mohammad Ruhul},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2021},
  pages={3265--3271},
  year={2021}
}