Home

Awesome

Kangri Monolingual and Hindi-Kangri Parllel Corpora

Table of contents

Text Corpora

This dataset contains the Monolingual and Parallel data that was processed from the July, 2019-January, 2021.

Train datset:

Kr_1 represent the kangri monolingual dataset which conatin the books collected from various short/long stories and novels. Apart from books we have also compiled the monolingual data by including conversations from various WhatsApp and Facebook groups.

Kr_2 represnt Hindi-Kangri dictionary words.

Kr_3 represnt the Kavitaiyein, Lok-Geet and kangri Gazals written by various kangri authors.

Kr_4 represnt the Parallel Hindi-Kangri dataset that has been created by distributing different everyday topics to kangri writers. Some of the categories are as follows: Hospital, Defense, Media, School, Music, Sports, Dance, Food, Parties, Law, Market, Marriage, Culture, History, Education, Technology, Religion, Stories.

Kr_4 is categrized Kr_4_Hindi and Kr_4_kangri

DatasetSentencesTokens
MonolingualKangri1.81M2377100
ParallelHindi26,862281076
ParallelKangri26,862271752

Citing

If you are using any of the resources, please cite the following article: preferred-citation:

@article{chauhan2021monolingual, </br> title={Monolingual and parallel corpora for kangri low resource language}, </br> author={Chauhan, Shweta and Saxena, Shefali and Daniel, Philemon}, </br> journal={arXiv preprint arXiv:2103.11596}, </br> year={2021} </br> }

License

Kangari Corpus is licensed under a Creative Commons v0.1 License.

Acknowledgement

We thank Dr. Karam Singh, Director of Himachal Academy of Arts Culture and Languages, Shimla, Himachal Pradesh, India for their efforts in arranging workshops to collect datasets. We thank to all the Kangri book authors. We also thank all the language translators/writers who manually compiles the dataset.

Contributors

Shweta Chauhan </br> Philemon Daniel </br>

Contacts

Shweta Chauhan (shweta@nith.ac.in) </br> Shefali Saxena (shefali@nith.ac.in) </br> Philemon Daniel (phildani7@nith.ac.in) </br>