Awesome

eng_guj_parallel_corpus

The repository contains 65k corpuses translated from Gujarati to English language.

The seperator used is '\n'. User can do some extra stuff to change the seperation, according to the need of the expected sulution.

About Dataset

Dataset is developed at the Language Processing Laboratory, Uka Tarsadia University, Gujarat, India. It was part of ongoing research on Natural Lanugage Processing and Machine Translation. This dataset contains around 65000 english sentiences from MSCOCO captioning dataset that are translated to Gujarati and converted to parallel format.

Citation

P. Shah and V. Bakrola, "Neural Machine Translation System of Indic Languages - An Attention based Approach," 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), Gangtok, India, 2019, pp. 1-5, doi: 10.1109/ICACCP.2019.8882969. IEEE Xlpore arXiv