Home

Awesome

Bhojpuri Language Technological Resources (BHLTR)

Introduction

The Bhojpuri LT Resources (BHLTR) project was initially initiated by me (Atul) at Jawaharlal Nehru University (JNU), New Delhi during the doctoral research work. BHLTR data contains monolingual, parallel (English-Bhojpuri), and POS annotated monolingual corpora. In this data, POS is annotated according to Bureau of Indian Standards (BIS) Part Of Speech (POS) tagset.

Structure of the BHLTR data folder

bho-resources/
├─ mono-bho-corpus/
│  ├─ monolingual.bho
│  ├─ README.md
│  ├─ pos-annotated/
│  │  └─ pos-tagged.bho
│  ├─ treebank/
│  │  └─ README.md
│  
└─ parallel-corpora/
   ├─ README.md
   ├─ eng-bho/
   │  └─ eng-bho.en
   │  └─ eng-bho.bho
├─ additional-resources.md
├─ license.md
├─ README.md
├─ README.txt
   

Acknowledgments

I would like to thank my Doctoral supervisor Prof. Girish Nath Jha and Sanskrit Computational Lab, JNU, New Delhi.

References

If you use this data, please cite:

<pre> @article{ojha2019english, title={English-Bhojpuri SMT System: Insights from the Karaka Model}, author={Ojha, Atul Kr}, journal={arXiv preprint arXiv:1905.02239}, year={2019} } </pre>

other papers/references about the BHLTR are:

<pre> @inproceedings{karakanta2019proceedings, title={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages}, author={Karakanta, Alina and Ojha, Atul Kr and Liu, Chao-Hong and Washington, Jonathan and Oco, Nathaniel and Lakew, Surafel Melaku and Malykh, Valentin and Zhao, Xiaobing}, booktitle={Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages}, year={2019} } </pre> <pre> @article{kumar2018automatic, title={Automatic identification of closely-related Indian languages: Resources and experiments}, author={Kumar, Ritesh and Lahiri, Bornini and Alok, Deepak and Ojha, Atul Kr and Jain, Mayank and Basit, Abdul and Dawer, Yogesh}, journal={arXiv preprint arXiv:1803.09405}, year={2018} } </pre> <pre> @inproceedings{ojha2015training, title={Training \& evaluation of POS taggers in Indo-Aryan languages: a case of Hindi, Odia and Bhojpuri}, author={Ojha, Atul Kr. and Behera, Pitambar and Singh, Srishti and Jha, Girish N}, booktitle={the proceedings of 7th Language \& Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics}, pages={524--529}, year={2015} } </pre> <pre> === Machine-readable metadata (DO NOT REMOVE!) ================================ Data available since: BHLTR v1.0 License: CC BY-NC-SA 4.0 Includes text: yes Contributors: Ojha, Atul Kr. Copyright (©) holder: Ojha, Atul Kr. Contact: shashwatup9k@gmail.com =============================================================================== </pre>