Home

Awesome

small_parallel_enja: 50k En/Ja Parallel Corpus for Testing SMT Methods

This directory includes a small parallel corpus for English-Japanese translation task. These data are extracted from TANAKA Corpus by filtering sentence length 4 to 16 words.

English sentences are tokenized using Stanford Tokenizer and lowercased. Japanese sentences are tokenized using KyTea.

All texts are encoded in UTF-8. Sentence separator is '\n' and word separator is ' '.

Attention: some English words have different tokenization results from Stanford Tokenizer, e.g., "don't" -> "don" "'t", which may came from preprocessing errors. Please take care of using this dataset in token-level evaluation.

Corpus Statistics

File#sentences#words#vocabulary
train.en50,000391,0476,634
- train.en.00010,00078,0493,447
- train.en.00110,00078,2233,418
- train.en.00210,00078,4273,430
- train.en.00310,00078,1183,402
- train.en.00410,00078,2303,405
train.ja50,000565,6188,774
- train.ja.00010,000113,2094,181
- train.ja.00110,000112,8524,102
- train.ja.00210,000113,0444,105
- train.ja.00310,000113,3464,183
- train.ja.00410,000113,1674,174
dev.en5003,931816
dev.ja5005,668894
test.en5003,998839
test.ja5005,635884