Home

Awesome

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

Model Model arXiv

Introduction

This repository contains information about Glot500 model, data, and code.

Glot500-m

You can use this model directly with a pipeline for masked language modeling:

>>> ! pip install transformers
>>> ! pip install sentencepiece
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='cis-lmu/glot500-base')
>>> unmasker("Hello I'm a <mask> model.")

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('cis-lmu/glot500-base')
model = AutoModelForMaskedLM.from_pretrained("cis-lmu/glot500-base")

# prepare input
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')

# forward pass
output = model(**encoded_input, output_hidden_states=True)

Glot500-m Evaluation

We provide in-depth evaluation of Glot500-m model and baselines in our paper. Each number is an average over head languages, tail languages and all languages. See the paper for detailed results per task and language. Glot500-m outperforms XLM-R-B (base) in all tasks for head (except for POS) and tail languages and XLM-R-L (large) for tail languages. Best result per task/language set is in bold.

tailtailtailheadheadheadallallall
XLM-R-BXLM-R-LGlot500-mXLM-R-BXLM-R-LGlot500-mXLM-R-BXLM-R-LGlot500-m
Pseudoperplexity304.2168.612.212.58.411.8247.8136.411.64
Sentence Retrieval Tatoeba (Top 10 Acc.)32.633.659.866.271.175.056.660.470.7
Sentence Retrieval Bible (Top 10 Acc.)7.47.143.254.258.359.019.320.147.3
Text Classification (F1)13.713.946.651.360.554.723.325.848.7
NER (F1)47.551.860.761.866.063.955.359.562.4
POS (F1)41.743.562.376.478.476.065.867.771.8
Roundtrip Alignment (Acc.)2.573.134.453.424.065.462.773.344.68

Glot500-c

This is an overview of the corpora included Glot500-c presented in our paper. Glot500-c will be sent via email upon filling the data request form. The part that we can redistribute is available at huggingface-dataset. For more information, check out the table below.

Disclaimer Please note that, while the data sources utilized in this study do not explicitly prohibit the reuse of data for research purposes, some sources do have copyright statements indicating that such use is permissible, while others do not. Additionally, certain sources prohibit the redistribution of data. As such, data from these sources is omitted from the published version of Glot500-c.
As regards the ND (NoDerivs) constraint for some datasets, we only change the format of the container while preserving the original contents. The first column of the table indicates the availability of each corpus in the downloadable Glot500-c (yes/no/partially).

We request all the users of Glot500-c to cite the original creators of the datsets and comply to each datasets' license. A BibTex file is available.

If you are a dataset owner and wish to update any part of this overview, or do not want your dataset to be included in Glot500-c, please send us an email at glot500@cis.lmu.de .

Glot500-c overview table:

AvailableDatasetRelated PapersLanguagesDomain / NotesData collection / Verification methodLicense
<details> <summary> <b> Click to Expand Table </b> </summary>
AvailableDatasetRelated PapersLanguagesDomain / NotesData collection / Verification methodLicense
Partially1000Langs-1500 languagesReligiousWeb-crawledApache License 2.0
YesAddLinkarz, afb, ajp, apcDialects, arabic commentariesAnnotatedFreely available for research purposes
YesAfriBERTaLinkamh, hau, ibo, orm, pcm, som, swa, tir, yormostly BBC, some Common CrawlApache License 2.0
YesAfroMAFTLink ; Link<details> afr, amh ,ara, eng, fra, hau, ibo, mlg, nya, orm, pcm, kin, sna, som, sot, swa, xho, yor, zul <summary> expand </summary> </details>Language Adaptation Corpushttps://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/
YesAI4BharatLink<details> pan, hin, ben, ori, asm, guj, mar, kan, tel, mal, tam <summary> expand </summary> </details>News, magazine, blog postsAutomatically curatedCC BY-NC-SA 4.0
YesAIFORTHAI-LotusCorpus-thaLarge vOcabualry Thai continUous Speech recognition (LOTUS) corpusCC BY-NC-SA 3.0 TH , 2005 Copyright by National Electronics and Computer Technology Center (NECTEC) For more information, visit http://www.nectec.or.th/rdi/lotus
YesAkuapem-akaParallel sentencesVerified by native speakersCC-BY 4.0
YesAnuvaad-<details> hin, ben, tam, mal, tel, kan, mar, pan, guj, asm, urd, ori <summary> expand </summary> </details>Various domains (General, Legal, Education, Healthcare, Automobile, News)CC-BY 4.0
YesAraBenchLinkarz, apc, afb, aryTranslations of 'travelling phrases', blogs, tv transcripts, BibleAvailable Dialectal Arabic-English resources and with curated evaluation setsApache License 2.0
YesAUTSHUMATO-tsn, tsoSouth African government domainCreative Commons Attribution 2.5 South Africa License
YesBianetLinkkur, eng, turParallel news corpusAutomatically curatedCC-BY-SA 4.0 open license
YesBLOOMLink<details> aaa, abc, ada, adq, aeu, agq, ags, ahk, aia, ajz, aka, ame, amp, amu, ann, aph, awa, awb, azn, azo, bag, bam, baw, bax, bbk, bcc, bce, bec, bef, bfd, bfm, bfn, bgf, bho, bhs, bis, bjn, bjr, bkc, bkh, bkm, bkx, bob, bod, boz, bqm, bra, brb, bri, brv, bss, bud, buo, bwt, bwx, bxa, bya, bze, bzi, cak, cbr, cgc, chd, chp, cim, clo, cmo, csw, cuh, cuv, dag, ddg, ded, dig, dje, dmg, dnw, dtp, dtr, dty, dug, eee, ekm, enb, enc, ewo, fli, fon, fub, fuh, gal, gbj, gou, gsw, guc, guz, gwc, hao, hbb, hig, hil, hla, hna, hre, hro, idt, ilo, ino, isu, jgo, jmx, jra, kak, kam, kau, kbq, kbx, kby, kek, ken, khb, kik, kin, kjb, kmg, kmr, kms, kmu, kqr, krr, ksw, kvt, kwd, kwu, kwx, kxp, kyq, laj, lan, lbr, lfa, lgg, lgr, lhm, lhu, lkb, llg, lmp, lns, loh, lsi, lts, lug, luy, lwl, mai, mam, mdr, mfh, mfj, mgg, mgm, mgo, mgq, mhx, miy, mkz, mle, mlk, mlw, mmu, mne, mnf, mnw, mot, mqj, mrn, mry, msb, muv, mve, mxu, myk, myx, mzm, nas, nco, new, nge, ngn, nhx, njy, nla, nlv, nod, nsk, nsn, nso, nst, nuj, nwe, nwi, nxa, nxl, nyo, nyu, nza, odk, oji, oki, omw, ozm, pae, pag, pbt, pce, pcg, pdu, pea, pex, pis, pkb, pmf, pnz, psp, pwg, qaa, qub, quc, quf, quz, qve, qvh, qvm, qvo, qxh, rel, rnl, roo, rue, rug, saq, sat, sdk, sea, sgd, shn, sml, snk, snl, sox, sps, ssn, stk, sxb, syw, taj, tbj, tdb, tdg, tdt, teo, tet, the, thk, thl, thy, tio, tkd, tnl, tnn, tnp, tnt, tod, tom, tpi, tpl, tpu, tsb, tsn, tso, tuv, tuz, tvs, udg, unr, ven, vif, war, wbm, wbr, wms, wni, wnk, wtk, xkg, xmd, xmg, xmm, xog, xty, yas, yav, ybb, ybh, ybi, ydd, yea, yet, yin, ymp, zaw, zlm, zuh <summary> expand </summary> </details>WebCrawl from Internet and filteringCC BY 4.0
YesCMU_Haitian_Creole-hat, engMedical domain phrases and sentences in English translated into Haitian Creole by Eriksen Translations, Inc.Curatedhttp://www.speech.cs.cmu.edu/haitian/text/COPYING
YesCC100Link ; Link<details> asm, ful, grn, lim, lin, lug, nso, orm, que, roh, srd, ssw, tsn, wol <summary> expand </summary> </details>WebCrawl from InternetStatistical Machine Translation at the University of Edinburgh makes no claims of intellectual property on the work of preparation of the corpus. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.
YesCCNetLinkMultiple languagesMultiple domainsDatasets from Common CrawlMIT License
YesClarin (subset)-Multiple languagesMultiple domainsMultipleCC-BY 4.0
YesCORP.NCHLT-<details> nde, nso, sot, ssw, tsn, tso, ven, xho, zul <summary> expand </summary> </details>VariousVariousCreative Commons Attribution 2.5 South Africa License
YesDARTLinkarz, afb, acm, apc, aryTweetsAnnotators involved also for quality controlPublicly available
YesEarthlingsLink<details> acu, afr, amh, amu, asm, aze, bel, ben, bod, bus, cak, cbc, cbs, cbv, ceb, chv, coe, crn, csb, cym, des, div, dop, epo, eus, fao, gle, glg, guj, gum, gym, hat, hbs, hye, ido, ilo, ipi, isl, jav, kab, kal, kan, kaz, khm, kir, knv, kpr, kur, kyc, kyq, lao, lez, lus, maa, mal, mar, maz, mkd, mlg, mlp, mon, mop, mpx, mri, mya, myy, nep, opm, ori, pan, pck, pir, poh, ptu, pus, que, sab, sah, scn, sin, sja, sme, snd, som, srd, srm, sua, swa, tat, tbc, tbz, tca, tel, tgk, tgl, tpi, tuk, ubu, udm, uig, urd, uzb, wal, wln, wol, yid, yor <summary> expand </summary> </details>Subset of CommonCrawlCrawl from Internet and filteringGNU-GPL v.3 License
YesFlores200Link<details> ace_Arab, ace_Latn, acm_Arab, acq_Arab, aeb_Arab, afr_Latn, ajp_Arab, aka_Latn, als_Latn, amh_Ethi, apc_Arab, arb_Arab, arb_Latn, ars_Arab, ary_Arab, arz_Arab, asm_Beng, ast_Latn, awa_Deva, ayr_Latn, azb_Arab, azj_Latn, bak_Cyrl, bam_Latn, ban_Latn, bel_Cyrl, bem_Latn, ben_Beng, bho_Deva, bjn_Arab, bjn_Latn, bod_Tibt, bos_Latn, bug_Latn, bul_Cyrl, cat_Latn, ceb_Latn, ces_Latn, cjk_Latn, ckb_Arab, crh_Latn, cym_Latn, dan_Latn, deu_Latn, dik_Latn, dyu_Latn, dzo_Tibt, ell_Grek, eng_Latn, epo_Latn, est_Latn, eus_Latn, ewe_Latn, fao_Latn, fij_Latn, fin_Latn, fon_Latn, fra_Latn, fur_Latn, fuv_Latn, gaz_Latn, gla_Latn, gle_Latn, glg_Latn, grn_Latn, guj_Gujr, hat_Latn, hau_Latn, heb_Hebr, hin_Deva, hne_Deva, hrv_Latn, hun_Latn, hye_Armn, ibo_Latn, ilo_Latn, ind_Latn, isl_Latn, ita_Latn, jav_Latn, jpn_Jpan, kab_Latn, kac_Latn, kam_Latn, kan_Knda, kas_Arab, kas_Deva, kat_Geor, kaz_Cyrl, kbp_Latn, kea_Latn, khk_Cyrl, khm_Khmr, kik_Latn, kin_Latn, kir_Cyrl, kmb_Latn, kmr_Latn, knc_Arab, knc_Latn, kon_Latn, kor_Hang, lao_Laoo, lij_Latn, lim_Latn, lin_Latn, lit_Latn, lmo_Latn, ltg_Latn, ltz_Latn, lua_Latn, lug_Latn, luo_Latn, lus_Latn, lvs_Latn, mag_Deva, mai_Deva, mal_Mlym, mar_Deva, min_Arab, min_Latn, mkd_Cyrl, mlt_Latn, mni_Beng, mos_Latn, mri_Latn, mya_Mymr, nld_Latn, nno_Latn, nob_Latn, npi_Deva, nso_Latn, nus_Latn, nya_Latn, oci_Latn, ory_Orya, pag_Latn, pan_Guru, pap_Latn, pbt_Arab, pes_Arab, plt_Latn, pol_Latn, por_Latn, prs_Arab, quy_Latn, ron_Latn, run_Latn, rus_Cyrl, sag_Latn, san_Deva, sat_Olck, scn_Latn, shn_Mymr, sin_Sinh, slk_Latn, slv_Latn, smo_Latn, sna_Latn, snd_Arab, som_Latn, sot_Latn, spa_Latn, srd_Latn, srp_Cyrl, ssw_Latn, sun_Latn, swe_Latn, swh_Latn, szl_Latn, tam_Taml, taq_Latn, taq_Tfng, tat_Cyrl, tel_Telu, tgk_Cyrl, tgl_Latn, tha_Thai, tir_Ethi, tpi_Latn, tsn_Latn, tso_Latn, tuk_Latn, tum_Latn, tur_Latn, twi_Latn, tzm_Tfng, uig_Arab, ukr_Cyrl, umb_Latn, urd_Arab, uzn_Latn, vec_Latn, vie_Latn, war_Latn, wol_Latn, xho_Latn, ydd_Hebr, yor_Latn, yue_Hant, zho_Hans, zho_Hant, zsm_Latn, zul_Latn <summary> expand </summary> </details>MiscHuman annotatedCC-BY-SA 4.0
FrenchEwe-ewe, fraParallel sentencesAnnotatedCC-BY 4.0
YesFFRLinkfon, fraParallel sentencesClean curated corporaMIT License and Licence Creative Commons Attribution - No Commercial Use - Sharing under the Same Conditions 4.0 International.
YesGiossaMediaLink ; Linkspa, grnParallel sentences, news and social mediaAutomatically curatedalso used by NLLB, freely available
YesGlossesLink256 languagesDisambiguated glossesWikipedia, Wiktionary, WordNet, OmegaWiki and Wikidata.CC BY-NC-SA 3.0
YesHabibiLinkarz, afb, acm, ary, apd, apcSong lyricsCollected from the WebFreely available for research purposes
YesHindialectLink<details> anp, awa, ben, bgc, bhb, bhd, bho, bjj, bns, bra, gbm, guj, hin, hne, kfq, kfy, mag, mar, mis, mup, noe, pan, raj, san <summary> expand </summary> </details>script all in DevanagarifolksongsCC BY-NC-SA 4.0
YesHornMT-aar, amh, eng, orm, som, tirmulti-way parallel corpusCC-BY 4.0
YesIITBLinkeng, hinCollected from different sources and corporaAutomatically collectedCC-BY-NC 4.0
YesIndiccorpLinkasm, ben, guj, kan, mal, mar, ory, pan, telWebWeb crawledCC BY-NC-SA 4.0
YesisiZulu-zul, engEnglish sentences, sampled from News Crawl datasets that were translated into isiZuluAnnotatedCC BY 4.0
YesJESCLinkeng, jpnMovie and tv subtitlesWeb-crawledCC-BY-NC 4.0
YesJParaCrawlLinkeng, jpnVarious domainsWeb crawled, automatically alignedCustom License
NoJW-ReligiousWeb crawledPrivate
YesKinyaSMTLinkkin,engBible+otherAutomatically translatedGNU General Public License v3.0
YesLeipzigDataLink<details> aar, ace, ach, aka, als, als-al, als-sqi, anw, arg, arz, asm, ast, aym, aze, azj, azj-az, bak, bam, ban, ban-id, bar, bcl, bem, bew, bih, bik, bjn, bjn-id, bod, bos, bpy, bua, bug, cdo, ceb, che, chv, ckb, cos, csb, diq, div, div-mv, dsb, dyu, ekk, emk, eml, ewe, ext, fao, fao-fo, fon, frr, fuc, ful, gan, glk, glv, gom, grn, gsw, gsw-ch, guj, hat, hat-ht, hbs, hbs-rs, hif, hil, hsb, ibb, ibo, ido, ile, ilo, ina, kab, kal, kal-gl, kas, kbd, kde, kea, khk, kik, kin, kng, knn, knn-in, koi, kom, kon, krc, ksh, ksw, lad, lgg, lim, lim-nl, lin, lmo, ltz, ltz-lu, lug, lup, lus, lus-in, lvs, mad, mad-id, mai, mhr, min, min-id, mkw, mlt, mos, mri, mri-nz, mrj, mwl, myv, mzn, nan, nap-tara, nav, nbl, ndo, nds, nds-nl, new, ngl, nno, nno-no, nob, nob-com, nob-no, nso, nso-za, nya, nyn, oci, oci-fr, orm, oss, pag, pam, pap, pcm, pfl, plt, pms, pnb, pnt, pus, roh, roh-ch, rom, rue, rue-ua, run, sah, san, scn, sco, seh, sgs, sin, skr, sme, sme-no, smi, sna, sna-zw, snd, snk, som, sot, sot-za, srd, ssw, ssw-za, suk, sun, sun-id, sus, swa, swh, szl, tat, tel, tem, tgk, tgk-tj, tgk-uz, tgl, tir, tiv, tsn, tsn-bw, tsn-za, tso, tso-za, tuk, tuk-tm, tum, tyv, udm, uig, uzb, uzn-uz, vec, vec-br, vec-hr, ven, ven-za, vls, vol, vro, war, wln, wol, wuu, xmf, ydd, yid, yor, zea, zha, zsm, zul, zul-za <summary> expand </summary> </details>Wikipedia, News, WebCrawl corpora of different yearsCrawl from InternetCC BY-NC-SA 3.0
YesLindat-Multiple languagesMultipleMultipleCC-BY-NC 4.0
YesLingala_Song_Lyrics-fra, linScrape the content of the website www.ndombolo.co, the site have almost 30 songs in lingala and their french traductionWeb scrapedalso used by NLLB, freely available
Lyrics-<details> aar, abq, adq, ady, agx, aih, ain, aka, akk, ale, ami, ang, arg, arn, arp, asm, ast, aym, bak, bam, bci, bft, bfy, bgc, bhb, bho, bik, bis, bns, bod, bsk, bvd, bya, cab, cbk, cha, che, chg, cho, chr, chv, ckm, cnr, com, cor, cre, crh, csb, ctg, dak, dng, doi, dua, dum, dyu, dzo, enm, evn, ewe, ewo, ext, fao, fij, fon, frm, fro, fur, gag, gbm, gil, gla, glg, glk, gmh, goh, gon, got, gqn, grc, grt, hif, hil, hlb, hne, hop, hsb, ido, ina, inh, ist, izh, jam, jbo, kab, kas, kbd, kca, kdr, kea, kfy, kha, kik, kin, kio, kir, kjh, kmb, kok, kom, kon, krc, krl, kru, ksh, kum, lad, lbj, ldd, lij, lin, lki, lkt, lmo, ltg, lzh, lzz, mag, mah, mai, mbx, mby, min, mjw, mnc, mni, mnk, mns, moh, mos, mrg, mus, mwl, mxi, nan, nap, nav, nds, new, nio, niu, nog, non, nys, oci, odt, ohu, orm, ory, ota, pag, pap, pau, pcd, pcm, pdt, pjt, pli, pnt, pot, que, qya, raj, rar, rhg, roh, rom, rop, rtm, rup, sag, sah, sat, scn, sco, sdc, sel, sgh, sgs, sjn, skr, slr, smn, srn, ssw, sux, syl, szl, tah, tat, tbh, tcy, tet, tir, tlh, tpi, tsn, tuk, twe, twi, tyv, tzo, udm, uig, uki, ulk, unr, vec, ven, vep, vot, wbl, wol, wym, xal, xmf, xno, xxb, yux, zap, zha, zpu, zun, zza <summary> expand </summary> </details>Song lyricsWeb-crawled
YesMaCoCuLinkmltCrawl from Internet and filteringCC0 - No Rights Reserved
YesMakerere MT Corpus-lug, engParallel sentencesAnnotatedCC BY 4.0
YesMasakhane MT Corpus-African languagesMultiple domainsMultipleMIT License
YesMburisano_Covid-afr, eng, nde, sot, ssw, tsn, tso, ven, xho, zulCorpus with limited domainManually translatedCC BY 3.0
YesMC4Link<details> aze, ceb, cos, fil, guj, hat, haw, hmn, ibo, ltz, mlt, mri, nya, smo, sna, sot, sun, tgk, yor, zul <summary> expand </summary> </details>WebCrawl from InternetODC-By
YesMenyo20KLinkyor, engParallel, multidomain<details> News articles (JW), ted talks, movie transcripts, radio transcripts, science and technology texts, and other short articles curatedfrom the web and professional translators <summary> Various sources: </summary> </details>Non-commercial use
YesMinangkabau corporaLinkmin_Latn, indParallel sentencesAnnotatedMIT License
YesMoTLinkkin, lin, nde, orm, bod, tirData collected from Voice of America (VOA) news websitesMIT License
PartiallyMTDataLinkMultiple languagesVarious sourcesMultiple licenses (check spreadsheet)
YesNart/abkhaz-abkmultiple sourcesCreative Commons Universal Public Domain License
YesNdc without informant codesdan, fao, isl, ovd, sweNordic Dialect Corpus comprises recorded speech data from the Nordic countries, in languages that belong to the North Germanic language family.VariousCC BY-NC-SA 4.0
YesNLLB_seedLink<details> ace_Arab, ace_Latn, ary, arz, bam, ban, bho, bja_Arab, bjn_Latn, bug, crh, dik, dzo, fur, fuv, grn, hne, kas_Latn, kas_Deva, knc_Arab, knc_Latn, lij, lim, lmo, ltg, mag, mni, mri, nus, prs, pbt, scn, shn, srd, szl, taq_Tfng, taq_Latn, tzm, vec <summary> expand </summary> </details>Collection of topics in different fields of knowledge and human activityProfessionally-translated sentences in the Wikipedia domainCC-BY-SA 4.0
OfisPublikLink ; LinkbreTexts from the Ofis Publik ar Brezhoneg (Breton Language Board) provided by Francis Tyers
PartiallyOPUSLinkCollection of translated texts from the webAutomatically collectedMultiple licenses (check spreadsheet)
YesOSCARLink<details> als, arg, arz, asm, ast, ava, aze, bak, bho, bod, bos, bpy, bxr, ceb, che, chv, ckb, cor, diq, div, dsb, eml, gom, grn, guj, hbs, hsb, ido, ilo, ina, jbo, kom, krc, lez, lim, lmo, ltz, mai, mhr, min, mlt, mrj, mzn, nah, nds, new, nno, oci, oss, pms, pnb, que, sah, scn, sun, tat, tgk, tuk, vol, war, wln, wuu, xal, xmf, yor <summary> expand </summary> </details>Web crawledCrawl from Internet and filteringCC BY 4.0
YesParaCrawl (subset)Linkeng, ukrVarious domainsWeb-crawledCC0
Upon direct requestParallel Bible CorpusLinkReligiousAutomatically collectedYou can contact Michael Cysouw, Philipps University of Marburg, to request access to the PBC for academic purposes.
YesParallel Corpora for Ethiopian LanguagesLinkamh, orm, tirParallel sentences, religious domainAutomatically curatedCC-BY 4.0
YesPhontron-eng, jpnWikipediaAnnotatedCC-BY-SA 3.0
YesQADILink<details> afb, abv, arq, arz, acm, apc, ary, acx, ajp, apd, aeb <summary> expand </summary> </details>TweetsTweetsApache License 2.0
YesQuechua-IICLinkquemultiple sourcesApache License 2.0
YesShamiLinkapc, ajpSeveral topics from regular conversations such as politics, education, society, health care, house keeping and othersAutomatic and manual approachesApache License 2.0
YesSLI_GalWeb.1.0LinkglgGalician political party, newspaper, government official websiteCrawling data from many Web data sourcesCC BY 4.0
YesStanford NLP: nmtLinkeng, deu, cze
PartiallyStatMT-Multiple languagesVarious sourcesVarious sourcesMultiple licenses (check spreadsheet)
YesTatoeba-<details> abk, acm, ady, afb, afh, afr, aii, ain, ajp, akl, aln, alt, amh, ang, aoz, apc, ara, arg, arq, ary, arz, asm, ast, avk, awa, ayl, aym, aze, bak, bal, bam, ban, bar, bcl, bel, ben, ber, bfz, bho, bis, bjn, bod, bom, bos, bre, brx, bua, bul, bvy, bzt, cat, cay, cbk, ceb, ces, cha, che, chg, chn, cho, chr, chv, cjy, ckb, ckt, cmn, cmo, cor, cos, cpi, crh, crk, crs, csb, cycl, cym, cyo, dan, deu, diq, div, dng, drt, dsb, dtp, dws, egl, ell, emx, eng, enm, epo, est, eus, evn, ewe, ext, fao, fij, fin, fkv, fra, frm, fro, frr, fry, fuc, fur, fuv, gaa, gag, gan, gbm, gcf, gil, gla, gle, glg, glv, gom, gos, got, grc, grn, gsw, guc, guj, hak, hat, hau, haw, hax, hbo, hdn, heb, hif, hil, hin, hnj, hoc, hrv, hrx, hsb, hsn, hun, hye, iba, ibo, ido, igs, iii, ike, ile, ilo, ina, ind, isl, ita, izh, jam, jav, jbo, jdt, jpa, jpn, kaa, kab, kal, kam, kan, kas, kat, kaz, kek, kha, khm, kin, kir, kiu, kjh, klj, kmr, knc, koi, kor, kpv, krc, krl, ksh, kum, kxi, kzj, laa, lad, lao, lat, ldn, lfn, lij, lim, lin, lit, liv, lkt, lld, lmo, lou, ltg, ltz, lug, lut, lvs, lzh, lzz, mad, mah, mai, mal, mar, max, mdf, mfa, mfe, mgm, mhr, mic, mik, min, mkd, mlg, mlt, mnc, mni, mnr, mnw, moh, mon, mri, mrj, mus, mvv, mwl, mww, mya, myv, nah, nan, nau, nav, nch, nds, new, ngt, ngu, niu, nld, nlv, nnb, nno, nob, nog, non, nov, npi, nst, nus, nya, nys, oar, oci, ofs, oji, ood, ori, orv, osp, oss, osx, ota, otk, pag, pal, pam, pan, pap, pau, pcd, pdc, pes, pfl, phn, pli, pms, pnb, pol, por, ppl, prg, pus, quc, que, qxq, qya, rap, rel, rhg, rif, roh, rom, ron, rue, run, rus, ryu, sag, sah, san, sat, scn, sco, sdh, sgs, shi, shs, shy, sin, sjn, skr, slk, slv, sma, sme, smo, sna, snd, som, sot, spa, sqi, srd, srn, srp, ssw, stq, sun, sux, swc, swe, swg, swh, syc, szl, tah, tam, tat, tel, tet, tgk, tgl, tha, thv, tig, tir, tkl, tlh, tly, tmr, tmw, toi, tok, ton, tpi, tpw, tsn, tso, tts, tuk, tur, tvl, tyv, tzl, udm, uig, ukr, umb, urd, urh, uzb, vec, vep, vie, vol, vro, war, wln, wol, wuu, xal, xho, xmf, xqa, yid, yor, yua, yue, zea, zgh, zlm, zsm, zul, zza <summary> expand </summary> </details>180922 versionVoluntary contributions of thousands of membersCC-BY 2.0 FR, CC0 1.0 Universal (more info)
YesTeDDiLink<details> abk, aey, amp, ape, apu, arn, arz, ayz, bmi, bsk, bsn, cha, ckt, crk, dgz, dni, fij, gni, gry, gug, gyd, hae, hau, hix, hnj, imn, jac, kal, kan, kew, kgo, khk, kio, kjq, kut, laj, lue, lvk, mig, mph, mya, myh, myp, mzh, naq, ote, pav, plt, pwn, qvi, ram, rap, rma, sag, spp, swh, tiw, tml, tzm, vma, wba, wic, wyb, xsu, yad, yaq, yor, zoc, zul <summary> expand </summary> </details>Collection of different sources (see paper)Language identification and filteringCC BY-NC-SA 4.0
YesTICOLink<details> amh, ara, ben, ckb, din, eng, fas, fra, fuv, hau, hin, ind, khm, knc, kmr, lug, lin, mar, msa, mya, npi, nus, orm, prs, por, pus, rus, kinn, som, spa, swh, tam, tir_et, tir_er, tgl, urd, zho, zul <summary> expand </summary> </details>COVID-19 materials for a variety of the world’s languagesAnnotatedCC0 1.0 Universal
YesTILLink<details> aze, bak, chv, eng, kaz, kir, rus, tuk, tur, tat, uig, uzb <summary> expand </summary> </details>Large-scale parallel corpus combinin gmost of the public datasets for 22 Turkic languagesAutomatically collectedCC BY-NC-SA 4.0
YesTildeLinkVarious domainsAutomatically curatedCC-BY 4.0
YesW2C-122 languagesCorpusAutomatically collected from wikipedia and the webCC BY-SA 3.0
YesWAT 2020https://arxiv.org/abs/2008.04550Asian languagesMultiple domainsCollection of corporaCC-BY-NC 4.0
YesWikipedia-<details> aar, abk, ace, ady, aka, als, ang, arc, arg, arz, asm, ast, atj, ava, aym, aze, bak, bam, bar, bcl, ben, bih, bis, bjn, bod, bos, bpy, bre, bug, bul, bxr, cbk, cdo, ceb, cha, che, cho, chr, chu, chv, chy, ckb, cor, cos, cre, crh, csb, din, diq, div, dsb, dty, dzo, eml, ewe, ext, fao, fij, frp, frr, ful, fur, gag, gan, glg, glk, glv, gom, gor, got, grn, guj, hak, hat, haw, hbs, hif, hmo, hsb, ibo, ido, iii, iku, ile, ilo, ina, inh, ipk, isl, jam, jbo, jpn, kaa, kab, kal, kas, kbd, kbp, kik, kin, koi, kom, kon, krc, ksh, kua, lad, lbe, lez, lfn, lij, lim, lin, lmo, lrc, ltg, ltz, lug, lzh, mah, mai, mdf, mhr, min, mlt, mri, mrj, mus, mwl, myv, mzn, nah, nan, nap, nau, nav, ndo, nds, new, nno, nov, nrm, nso, nya, oci, olo, orm, oss, pag, pam, pan, pap, pcd, pdc, pfl, pih, pli, pms, pnb, pnt, que, rmy, roh, rue, run, rup, rus, sag, sah, sat, scn, sco, sgs, sme, smo, sna, sot, srd, srn, ssw, stq, sun, szl, tah, tat, tcy, tet, tgk, tir, ton, tpi, tsn, tso, tuk, tum, twi, tyv, udm, vec, ven, vep, vls, vol, vro, war, wln, wol, wuu, xal, xmf, yor, yue, zea, zha, zul <summary> expand </summary> </details>20221001WikipediaCC BY-NC-SA 3.0
YesWikiMatrixLink85 languagesWikipediaAutomatically curatedCC-BY-SA
YesWorkshop on NER for South and South East Asian LanguagesLinkben, ori, urdAnnotatedData can be freely used for non-profit research work under the Creative Commons License.
XhosaNavyLinkxho, engSouth African Navy parallel corpus
YesXLSumLinkaze, guj, ibo, orm, run, tir, yorBBCCC BY-NC-SA 4.0
</details> <br/>

↑ top

Training and Evalutaion Code

Prerequisites

We use two settings due to package conflict:

Data preparation

For training both tokenizer and model of Glot500-m, we need to prepare a balanced corpus covering all languages.

Go to 'preprocessing/' and run:

bash merge_files.sh

Specify --data_directory with the directory to data for each language and --save_directory with the directory for putting the merged file. For Glot500, we set --scale 1 for training tokenizer, --scale 30 for continued pretraining the model.

Vocabulary Extension

Go to 'tokenization/' and run:

bash train.sh

Specify --input_fname with the merged data file for training the tokenizer and --save_directory with the directory for saving the final tokenizer.

Continued Pretraining

Go to 'modeling/' and run:

bash train_bash.sh

Specify train_file with the merged data file for continued pretraining the model, --tokenizer_name with the trained Huggingface-style tokenizer, --output_dir with the directory for saving logs and checkpoints during training, and --cache_dir with the directory for saving Huggingface cache.

↑ top

Evaluation

Download Datasets

For downloading datasets for NER, POS, and Sentence Retrieval Tatoeba, first go to 'evaluation/download_data' and create a download folder with mkdir -p download. You then need to manually download panx_dataset (for NER) from here (note that it will download as AmazonPhotos.zip) to the download directory. Finally, run the following command under 'evaluation/download_data' to download and process the datasets:

bash download_data.sh

For downloading datasets for Sentence Retrieval Bible, Round-Trip Alignment, you can contact Michael Cysouw, Philipps University of Marburg, to request access to the Parallel Bible Corpus for academic purposes.

Sequence Labeling

For NER evaluation, go to 'evaluation/tagging' and run:

bash evaluate_ner.sh

Specify DATA_DIR with the directory for NER dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

For POS evaluation, go to 'evaluation/tagging' and run:

bash evaluate_pos.sh

Specify DATA_DIR with the directory for POS dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

Sentence Retrieval

For Sentence Retrieval Tatoeba evaluation, go to 'evaluation/retrieval' and run:

bash evaluate_retrieval_tatoeba.sh

Specify DATA_DIR with the directory for Sentence Retrieval Tatoeba dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

For Sentence Retrieval Bible evaluation, go to 'evaluation/retrieval' and run:

bash evaluate_retrieval_bible.sh

Specify DATA_DIR with the directory for Sentence Retrieval Bible dataset, OUTPUT_DIR with the directory for saving the fine-tuned models and evaluation results.

Round-Trip Alignment

For Round-Trip Alignment evaluation, go to 'evaluation/round-trip' and run:

python evaluate_roundtrip.py
<br/>

↑ top

Citation

If you find our model, data or the overview of data useful for your research, please cite:

@inproceedings{imanigooghari-etal-2023-glot500,
	title        = {Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages},
	author       = {ImaniGooghari, Ayyoob  and Lin, Peiqin  and Kargaran, Amir Hossein  and Severini, Silvia  and Jalili Sabet, Masoud  and Kassner, Nora  and Ma, Chunlan  and Schmid, Helmut  and Martins, Andr{\'e}  and Yvon, Fran{\c{c}}ois  and Sch{\"u}tze, Hinrich},
	year         = 2023,
	month        = jul,
	booktitle    = {Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
	publisher    = {Association for Computational Linguistics},
	address      = {Toronto, Canada},
	pages        = {1082--1117},
	url          = {https://aclanthology.org/2023.acl-long.61}
}

Acknowledgements

This repository is built on top of transformers and xtreme.