Home

Awesome

TED-Parallel-Corpus

TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages. It includes Monolingual corpus, 12 languages for Bilingual parallel corpus over 120 million aligned sentences and 13 languages for Multilingual Parallel corpus with more than 600k sentences. The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. All pre-processing is done automatically. No manual corrections have been carried out.

Author

Mr. Ajinkya kulkarni, Contact: ajinkyakulkarni14@gmail.com


Multilingual Parallel Corpus :


12 languages aligned Parallel corpus data : It contains Parallel aligned sentences for 12 languages which encovers ar Arabic, zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, nl Dutch, fr French, de German, he Hebrew, it Italian, ja Japanese, ko Korean, ru Russian, es Spanish.

Sentences : 349049


4 languages aligned parallel corpus data: It contains Parallel aligned sentences for 4 South Asian languages which encovers zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, ja Japanese, ko Korean.

Sentences : 389764


Bilingual Parallel Corpus :

Language 1Language 2SentencesLanguage 1Language 2Sentences
RussianSpanish523485KoreanFrench462616
ArabicHebrew512358KoreanHebrew485919
DutchRussian442167SpanishHebrew486466
ArabicRussian555618DutchChinese, Traditional406528
HebrewSpanish486466HebrewGerman449485
SpanishChinese, Simplified479771HebrewFrench464923
SpanishRussian523485HebrewItalian480730
RussianChinese, Simplified533541RussianItalian523015
GermanChinese, Traditional438420DutchSpanish415347
ItalianSpanish477021Chinese, SimplifiedChinese, Traditional464982
SpanishFrench463476Chinese, SimplifiedGerman442415
ArabicDutch411929KoreanSpanish486162
Chinese, TraditionalArabic473423HebrewDutch415768
FrenchItalian458939GermanHebrew449485
RussianDutch442167Chinese, TraditionalItalian455363
DutchItalian407669ArabicItalian486628
RussianArabic555618ArabicChinese, Traditional473423
Chinese, TraditionalSpanish465481Chinese, TraditionalRussian506240
GermanChinese, Simplified442415SpanishDutch415347
FrenchGerman442292DutchHebrew415768
Chinese, SimplifiedFrench458083SpanishGerman452661
ArabicSpanish491987RussianChinese, Traditional506240
Chinese, SimplifiedDutch406971HebrewChinese, Traditional473169
GermanArabic445899ArabicGerman445899
GermanDutch411134Chinese, SimplifiedItalian473247
ItalianChinese, Simplified473247ArabicFrench469558
Chinese, TraditionalDutch406528HebrewRussian541540
FrenchHebrew464923ItalianHebrew480730
HebrewArabic512358FrenchArabic469558
Chinese, SimplifiedHebrew496348RussianHebrew541540
HebrewChinese, Simplified496348GermanRussian479543
Chinese, SimplifiedArabic502194SpanishItalian477021
FrenchChinese, Traditional448751DutchArabic411929
ItalianGerman444088Chinese, TraditionalGerman438420
DutchChinese, Simplified406971SpanishArabic491987
Chinese, TraditionalHebrew473169RussianGerman479543
GermanFrench442292Chinese, TraditionalFrench448751
SpanishChinese, Traditional465481SpanishKorean486162
DutchGerman411134FrenchDutch409715
ItalianChinese, Traditional455363ItalianDutch407669
FrenchRussian500195FrenchSpanish463476
GermanSpanish452661RussianFrench500195
Chinese, TraditionalChinese, Simplified464982ItalianRussian523015
ArabicChinese, Simplified502194GermanItalian444088
FrenchChinese, Simplified458083ItalianFrench458939
Chinese, SimplifiedSpanish479771Chinese, SimplifiedRussian533541
HebrewKorean485919DutchFrench409715
FrenchKorean462616ItalianArabic486628

Monolingual Corpus :

LanguageSentencesLanguageSentencesLanguageSentences
Azerbaijan20852Swahili7204Assamese57
Chinese, Yue20940Czech272464Khmer614
Latgalian9Silesian91Norwegian Nynorsk3012
Chinese, Simplified507085Basque12303Occitan54
Algerian Arabic1716Macedonian69086Hupa3
Belarusian10965Montenegrin4181Danish128916
Macedo3068Finnish61604Igbo68
Croatian326967Hungarian398138Asturian232
Malayalam7218Punjabi48Serbian359791
Turkish433023Russian609744Irish256
Bulgarian475860Bislama49Kazakh6993
Tagalog2397Afrikaans2903Filipino7513
Nepali4350French493026Icelandic4957
Vietnamese349731German471902Mongolian19737
Albanian148541Esperanto18966French (Canada)68316
Slovak175052Georgian37013Telugu4104
Maltese343Latin46Serbo-Croatian5239
Swedish Chef375Cebuano203Tamil20805
Somali4545Uyghur1410Bosnian20522
Hindi48513Galician22368Slovenian63981
Tibetan2085Romanian454412Indonesian236543
Catalan89358Lao854Tatar277
Ingush377Ukrainian282163Kyrgyz1480
Tajik1147Kannada3716Hausa51
Arabic553483Gujarati5636Klingon131
Amharic1596Italian501685Dutch433318
Latvian60171Marathi22345Swedish121479
Estonian33236Lithuanian116956Sinhala1602
Creole, Haitian417Malagasy729Persian362411
Uzbek6201Bengali17107Hebrew535665
Pashto491Armenian69923
Spanish521162Luxembourgish217
Thai237086Portuguese, Brazilian476576
Burmese41266Urdu19861
Portuguese250967Chinese, Traditional483199
Norwegian Bokmal47441Malay23502

Author

Mr. Ajinkya kulkarni, Contact: ajinkyakulkarni14@gmail.com


Conditions of use

The TED-Multilingual-Parallel-Corpus contain text from publicly accessible source www.ted.com . All data have been processed automatically so that it is not possible to reconstruct the original source texts. They are made available on the condition that they may be used for scientific purposes only and not passed on to third parties. Any use of the data must be duly documented and referenced.


Disclaimer

The TED-Multilingual-Parallel-Corpus have been processed automatically from www.ted.com . accessible sources based on the outlined methodology without considering in detail the content of the contained text. No responsibility is taken for the content of the data. In particular, the views and opinions expressed in specific parts of the data remain exclusively with the authors. For each word, the list of words that significantly co-occur with that word are computed on the basis of the available text and neither express a general fact of language nor the particular view of author for Natural Language Processing. Please let us know if you find problems with the data or if you want the data for other language pairs.