Home

Awesome

EVBCorpus - English-Vietnamese Parallel corpus

for Comparative Linguistics, Machine Translation, and Vietnamese NLP tasks

The EVBCopus contains over 20,000,000 words (20 million) from 15 bilingual books, 100 parallel English-Vietnamese / Vietnamese-English texts, 250 parallel law and ordinance texts, 5,000 news articles, and 2,000 film subtitles. The composition, annotation, encoding and availability of the corpus are meant to facilitate developments of language technology and studies in bilingual terminology extraction, primarily for the English-Vietnamese-English language pair.

The building EVBCorpus process includes four main steps:

  1. Collect data and align bitext at the paragraph level;
  2. Align bitext at the sentence level,
  3. Linguistic analysis and tagging;
  4. Annotate and correct corpus with toolkits. As result, the EVBCopus was aligned at the sentence level; and a part of this corpus containing 5,000 news articles was aligned at the word level by tool and annotators.

Release EVBNews v.1.0 with 1,000 parallel documents, download at: https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v1.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v1.0.rar?attredirects=0&d=1

**Release EVBNews v.2.0 with 1,000 word aligned parallel documents, download at: ** https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v2.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v2.0.rar?attredirects=0&d=1

If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.

Detail of Upgrade EVBCorpus v.2.0 (2018):

SourceDocumentParagraphSentenceWord
Books1514,19561,1671,335,180
Fictions100192,898489,7876,129,161
Laws25086,84898,0641,981,932
ETests50020,28821,575411,093
News5,00094,933173,9032,965,590
Subtitles2,0001,302,8391,447,5818,150,080
Total7,8651,712,0012,292,07720,973,036

Details of data sources of EVBCorpus v.1.0 (2012):

SourceDocumentParagraphSentenceWord
Books1513,98080,3231,375,492
Fictions100192,723491,7036,307,613
Laws25086,80398,1021,912,055
News1,00024,52345,531740,534
Total1,365318,029715,65910,431,592

English-Vietnamese Word Alignment Corpus (EVWACorpus)

The EVWACorpus contains 1,000 news articles with 45,531 sentence pairs and 740,534 words which are aligned manually at the word level between English and Vietnamese sentence. Details of the EVWACorpus:

--EnglishVietnamese
Files1,0001,000
Sentences45,53145,531
Words740,534832,441
Sure Alignments447,906447,906
Possible Alignments560,215560,215
Words in Alignments654,060768,031

English-Vietnamese Chunker Corpus (EVChkCorpus)

The EVChkCorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged 5 raw chunker tags in both English and Vietnamese text. Details of the EVChkCorpus:

TagNameEnglishVietnamese
NPNoun Phrase212,500209,824
VPVerb Phrase90,784123,600
PPPreposition Phrase79,85370,457
ADVPAdjective Phrase18,318
ADJPAdverb Phrase8,36715,104

English-Vietnamese Named Entities Corpus (EVNECorpus)

The EVNECorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged named entities in both English and Vietnamese text. Details of the EVNECorpus:

LabelNameEnglishVietnamese
LOCLocation10,11510,006
PERPerson6,8696,741
ORGOganization7,8377,549
PCTPercentage1,107921
MONMoney898823
TIMTime4,2444,100
-Total35,87934,732

The canonical publication for the EVBNews or EVBCorpus is:

Quoc Hung Ngo, Werner Winiwarter, and Bartholomaus Wloka, (2013). "EVBCorpus - A Multi-Layer English-Vietnamese Bilingual Corpus for Studying Tasks in Comparative Linguistics", In Proceedings of the 11th Workshop on Asian Language Resources (11th ALR within the IJCNLP2013), pp. 1-9. Asian Federation of Natural Language Processing, 2013.

Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160. IEEE Computer Society, 2012.

The canonical publication for the EVNECorpus is:

Quoc Hung Ngo, Dinh Dien, and Werner Winiwarter, (2014). "Building English-Vietnamese Named Entity Corpus with Aligned Bilingual News Articles", The 5th Workshop on South and Southeast Asian Natural Languages Processing (5th SSANLP within the COLING2014). Association for Computational Linguistics, 2014.

The canonical publication for the Annotation Tool is:

Quoc-Hung Ngo, Werner Winiwarter (2012). "A Visualizing Annotation Tool for Semi-Automatically Building a Bilingual Corpus", In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, LREC2012 Workshop, pages 67-74. Association for Computational Linguistics, 2012.

The canonical publication for the GetWebContent tool is:

Quoc-Hung Ngo, Dinh Dien, Werner Winiwarter, (2012). "Automatic Searching for English-Vietnamese Documents on the Internet", The 3rd Workshop on South and Southeast Asian Natural Languages Processing (3rd SSANLP within the COLING2012), pp. 211-220. Association for Computational Linguistics, 2012.

In Use with academic purposes:

If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.