Awesome
The IndoWordnet Parallel Corpus
IndoWordnet is a linked structure of wordnets of major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families. Synsets are linked across many languages. Every synset in every language contains a gloss and example usage sentence/phrase. In a large number of cases, the example and gloss sentences across languages are translations. Hence, IndoWordNet is a source of parallel corpora across multiple Indian languages.
The corpus contains about 6.3 million parallel segments across 18 Indian languages from 3 languages families.
NEWS! WMT 2020 is using this corpus for the shared task on similar language translation
Documentation
You can read more about the corpus in this document: pdf
Download the corpus
You can download the corpus HERE
Version History
- v0.2 (14 May 2020): Bug fixes to address problems with extraction in v0.1.
- v0.1 (25 March 2020): Initial release (BUGGY: don't use this version, use v0.2)
License
This dataset is released under the Creative Commons Attribution Share Alike 4.0 International license.
Citing this dataset
If you use this dataset, please include the following citation:
@misc{kunchukuttan2020iwnparallel,
author = "Anoop Kunchukuttan",
title = "IndoWordnet Parallel Corpus",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indowordnet_parallel}}
}
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.