Awesome

small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods

This directory includes a small parallel corpus for English-Japanese translation task. These data are extracted from TANAKA Corpus by filtering sentence length 4 to 16 words.

English sentences are tokenized using Stanford Tokenizer and lowercased. Japanese sentences are tokenized using KyTea.

All texts are encoded in UTF-8. Sentence separator is '\n' and word separator is ' '.

Attention: some English words have different tokenization results from Stanford Tokenizer, e.g., "don't" -> "don" "'t", which may came from preprocessing errors. Please take care of using this dataset in token-level evaluation.

Corpus Statistics

File	#sentences	#words	#vocabulary
train.en	50,000	391,047	6,634
- train.en.000	10,000	78,049	3,447
- train.en.001	10,000	78,223	3,418
- train.en.002	10,000	78,427	3,430
- train.en.003	10,000	78,118	3,402
- train.en.004	10,000	78,230	3,405
train.ja	50,000	565,618	8,774
- train.ja.000	10,000	113,209	4,181
- train.ja.001	10,000	112,852	4,102
- train.ja.002	10,000	113,044	4,105
- train.ja.003	10,000	113,346	4,183
- train.ja.004	10,000	113,167	4,174
dev.en	500	3,931	816
dev.ja	500	5,668	894
test.en	500	3,998	839
test.ja	500	5,635	884