Awesome

EVBCorpus - English-Vietnamese Parallel corpus

for Comparative Linguistics, Machine Translation, and Vietnamese NLP tasks

The EVBCopus contains over 20,000,000 words (20 million) from 15 bilingual books, 100 parallel English-Vietnamese / Vietnamese-English texts, 250 parallel law and ordinance texts, 5,000 news articles, and 2,000 film subtitles. The composition, annotation, encoding and availability of the corpus are meant to facilitate developments of language technology and studies in bilingual terminology extraction, primarily for the English-Vietnamese-English language pair.

The building EVBCorpus process includes four main steps:

Collect data and align bitext at the paragraph level;
Align bitext at the sentence level,
Linguistic analysis and tagging;
Annotate and correct corpus with toolkits. As result, the EVBCopus was aligned at the sentence level; and a part of this corpus containing 5,000 news articles was aligned at the word level by tool and annotators.

Release EVBNews v.1.0 with 1,000 parallel documents, download at: https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v1.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v1.0.rar?attredirects=0&d=1

**Release EVBNews v.2.0 with 1,000 word aligned parallel documents, download at: ** https://github.com/qhungngo/EVBCorpus/blob/master/EVBCorpus_EVBNews_v2.0.rar https://sites.google.com/a/uit.edu.vn/hungnq/evbcorpus/EVBCorpus_EVBNews_v2.0.rar?attredirects=0&d=1

If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.

Detail of Upgrade EVBCorpus v.2.0 (2018):

Source	Document	Paragraph	Sentence	Word
Books	15	14,195	61,167	1,335,180
Fictions	100	192,898	489,787	6,129,161
Laws	250	86,848	98,064	1,981,932
ETests	500	20,288	21,575	411,093
News	5,000	94,933	173,903	2,965,590
Subtitles	2,000	1,302,839	1,447,581	8,150,080
Total	7,865	1,712,001	2,292,077	20,973,036

Details of data sources of EVBCorpus v.1.0 (2012):

Source	Document	Paragraph	Sentence	Word
Books	15	13,980	80,323	1,375,492
Fictions	100	192,723	491,703	6,307,613
Laws	250	86,803	98,102	1,912,055
News	1,000	24,523	45,531	740,534
Total	1,365	318,029	715,659	10,431,592

English-Vietnamese Word Alignment Corpus (EVWACorpus)

The EVWACorpus contains 1,000 news articles with 45,531 sentence pairs and 740,534 words which are aligned manually at the word level between English and Vietnamese sentence. Details of the EVWACorpus:

--	English	Vietnamese
Files	1,000	1,000
Sentences	45,531	45,531
Words	740,534	832,441
Sure Alignments	447,906	447,906
Possible Alignments	560,215	560,215
Words in Alignments	654,060	768,031

English-Vietnamese Chunker Corpus (EVChkCorpus)

The EVChkCorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged 5 raw chunker tags in both English and Vietnamese text. Details of the EVChkCorpus:

Tag	Name	English	Vietnamese
NP	Noun Phrase	212,500	209,824
VP	Verb Phrase	90,784	123,600
PP	Preposition Phrase	79,853	70,457
ADVP	Adjective Phrase	18,318
ADJP	Adverb Phrase	8,367	15,104

English-Vietnamese Named Entities Corpus (EVNECorpus)

The EVNECorpus contains 1,000 news articles with 45,531 sentence pairs. It is tagged named entities in both English and Vietnamese text. Details of the EVNECorpus:

Label	Name	English	Vietnamese
LOC	Location	10,115	10,006
PER	Person	6,869	6,741
ORG	Oganization	7,837	7,549
PCT	Percentage	1,107	921
MON	Money	898	823
TIM	Time	4,244	4,100
-	Total	35,879	34,732

The canonical publication for the EVBNews or EVBCorpus is:

Quoc Hung Ngo, Werner Winiwarter, and Bartholomaus Wloka, (2013). "EVBCorpus - A Multi-Layer English-Vietnamese Bilingual Corpus for Studying Tasks in Comparative Linguistics", In Proceedings of the 11th Workshop on Asian Language Resources (11th ALR within the IJCNLP2013), pp. 1-9. Asian Federation of Natural Language Processing, 2013.

Quoc-Hung Ngo, Werner Winiwarter, (2012). "Building an English-Vietnamese Bilingual Corpus for Machine Translation", International Conference on Asian Language Processing 2012 (IALP 2012), pp. 157-160. IEEE Computer Society, 2012.

The canonical publication for the EVNECorpus is:

Quoc Hung Ngo, Dinh Dien, and Werner Winiwarter, (2014). "Building English-Vietnamese Named Entity Corpus with Aligned Bilingual News Articles", The 5th Workshop on South and Southeast Asian Natural Languages Processing (5th SSANLP within the COLING2014). Association for Computational Linguistics, 2014.

The canonical publication for the Annotation Tool is:

Quoc-Hung Ngo, Werner Winiwarter (2012). "A Visualizing Annotation Tool for Semi-Automatically Building a Bilingual Corpus", In Proceedings of the 5th Workshop on Building and Using Comparable Corpora, LREC2012 Workshop, pages 67-74. Association for Computational Linguistics, 2012.

The canonical publication for the GetWebContent tool is:

Quoc-Hung Ngo, Dinh Dien, Werner Winiwarter, (2012). "Automatic Searching for English-Vietnamese Documents on the Internet", The 3rd Workshop on South and Southeast Asian Natural Languages Processing (3rd SSANLP within the COLING2012), pp. 211-220. Association for Computational Linguistics, 2012.

In Use with academic purposes:

Trieu, Hai Long, Vu Tran, and Nguyen Le Minh. "Investigating phrase-based and neural-based machine translation on low-resource settings." Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation. 2017.
Trieu, Long Hai. "A Study On Machine Translation For Low-Resource Languages". Thesis of Doctor of Philosophy, JAIST, 2017. Phuoc, Nguyen Quang, Yingxiu Quan, and Cheol-Young Ock. "Building a bidirectional English-Vietnamese statistical machine translation system by using MOSES." International Journal of Computer and Electrical Engineering 8.2 (2016): 161.
Song Cong Nguyen Duc; Q.Hung Ngo; JIAMTHAPTHAKSIN, Rachsuda. State-of-the-art Vietnamese word segmentation. In: Science in Information Technology (ICSITech), 2016 2nd International Conference on. IEEE, 2016. p. 119-124.
Nguyen, L. H., Dinh, D., & Tran, P. (2016). An Approach to Construct a Named Entity Annotated English-Vietnamese Bilingual Corpus. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 16(2), 9.
Dawborn, Timothy James. "DOCREP: Document Representation for Natural Language Processing." Thesis of Doctor of Philosophy, The University of Sydney, 2015.
Lam, Khang Nhut. "Automatically creating multilingual lexical resources." Proceedings of the Nineteenth AAAI/SIGAI Doctoral Consortium. 2014.
Huy, Dang Ngoc, and Pusadee Seresangtakul. "Vietnamese-Thai Lexicon for Machine Translation." The Tenth Symposium on Natural Language Processing (SNLP2013), Phuket, Thailand. 2013.
GIANG, Lam Tung; HUNG, Vo Trung; PHAP, Huynh Cong. Experiments with query translation and re-ranking methods in Vietnamese-English bilingual information retrieval. In: Proceedings of the Fourth Symposium on Information and Communication Technology. ACM, 2013. p. 118-122.

If you are interested in the corpus, please email to hungnq(at)uit.edu.vn to have more details.