Home

Awesome

The IndoWordnet Parallel Corpus

IndoWordnet is a linked structure of wordnets of major Indian languages from Indo-Aryan, Dravidian and Sino-Tibetan families. Synsets are linked across many languages. Every synset in every language contains a gloss and example usage sentence/phrase. In a large number of cases, the example and gloss sentences across languages are translations. Hence, IndoWordNet is a source of parallel corpora across multiple Indian languages.

The corpus contains about 6.3 million parallel segments across 18 Indian languages from 3 languages families.

NEWS! WMT 2020 is using this corpus for the shared task on similar language translation

Documentation

You can read more about the corpus in this document: pdf

Download the corpus

You can download the corpus HERE

Version History

License

This dataset is released under the Creative Commons Attribution Share Alike 4.0 International license.

Citing this dataset

If you use this dataset, please include the following citation:

@misc{kunchukuttan2020iwnparallel,
author = "Anoop Kunchukuttan",
title = "IndoWordnet Parallel Corpus",
year = "2020",
howpublished={\url{https://github.com/anoopkunchukuttan/indowordnet_parallel}}
}

We would like to hear from you if: