Awesome

hindi-discourse

A repository for the hindi discourse analysis dataset and its related experiments.

The Hindi Discourse Analysis dataset is a corpus for analyzing discourse modes present in its sentences. It contains sentences from stories written by 11 famous authors from the 20th Century. 4-5 stories by each author have been selected which were available in the public domain resulting in a collection of 53 stories. Most of these short stories were originally written in Hindi but some of them were written in other Indian languages and later translated to Hindi.

The corpus contains a total of 10472 sentences belonging to the following categories:

Argumentative
Descriptive
Dialogic
Informative
Narrative

Dataset Overview

The labeled data could be downloaded from here.

The data is shared in a json format. Each entry is a sentence along with it's annotated majority label:

{'Story_no': 15, 'Sentence': ' गाँठ से साढ़े तीन रुपये लग गये, जो अब पेट में जाकर खनकते भी नहीं! जो तेरी करनी मालिक! ” “इसमें मालिक की क्या करनी है? ”', 'Discourse Mode': 'Dialogue'}

Distribution of sentences in different discourse modes

Inter-annotator agreements each discourse mode and for the entire dataset as measured using Fleiss‘s Kappa

Sample sentences from Hindi Discourse Analysis Dataset for each discourse mode

Most common stop-words for each discourse mode

Part-of-speech analysis for each discourse mode

Each entry in the table is the average number of words tagged as the column label. ADJ: Adjective, ADP: Adposition, ADV: adverb, AUX: auxiliary, CCONJ: Coordinating Conjunction, DET: Determiner, INTJ: Interjection, NOUN: Noun, NUM: Numeral, PART: Particle, PRON: Pronoun, PROPN: Proper Noun, PUNCT: Punctuation, SCONJ: Subordinating Conjunction, SYM: Symbol, VERB: verb

Terms of Use

This corpus can be used freely for research purposes.
The paper listed below provide details of the creation and use of the corpus. If you use the corpus, then please cite the paper.
If interested in commercial use of the corpus, send email to midas@iiitd.ac.in.
If you use the corpus in a product or application, then please credit the authors and Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi appropriately. Also, if you send us an email, we will be thrilled to know about how you have used the corpus.
Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi, India disclaims any responsibility for the use of the corpus and does not provide technical support. However, the contact listed above will be happy to respond to queries and clarifications.
Rather than redistributing the corpus, please direct interested parties to this page

Please feel free to send us an email:

with feedback regarding the corpus.
with information on how you have used the corpus.
if interested in a collaborative research project.

References

Paper accepted at LREC 2020. Please check back soon.