



The CAMeL Arabic Frequency Lists dataset is derived from the pretraining datasets used to pretrain the family of CAMeLBERT models (16.1M unique word types / 17.3B word tokens). Three main varieties of Arabic were used: Classical Arabic (CA), Dialectal Arabic (DA), and Modern Standard Arabic (MSA).

To download, please click on the link below:

For details about the different genres and sources of the data, please refer to the CAMeLBERT paper here.

Each of the frequency list files contains unique types of Arabic only words along with their frequencies as they appeared in the pretraining data. We excluded digits, punctuation, and non-Arabic script tokens.

All files are tab-separated with the first column being the word in Arabic script and the second column being the frequency (note that due to the mixed text direction the the order may be displayed in reverse). See the following example:

في	16664531
من	15695517
بن	13571947
الله	11433931
عن	9140820
المستعان	6285
الورقة	6284
الروياني	6284
الثريا	6283
يسافر	6283

فكعمرة	4
فكعرض	4
فكضامن	4
فكرؤيته	4
فكتفريق	4
من	127245884
في	101567242
الله	72525262
علي	65410197
لا	52420507
قضيته	70256
دره	70235
تعطيك	70226
تهديد	70216
الاوراق	70213
هالمكااان	35
هالشوز	35
هالرغد	35
هالثبات	35
نننس	35

في	255725161
من	205864175
على	122591931
و	68783652
أن	64519408
السائل	128423
ثانوى	128420
الحيوانية	128417
نزيف	128393
عصابة	128386
سهرن	52
ستنسيه	52
ستمتلكه	52
ستكفينا	52
ستضره	52

في	373956934
من	348805576
على	132084198
و	121102569
الله	111745498
وفدا	213505
المنافقين	213483
البيلاروسي	213461
الطيبين	213441
اساسي	213409
كهلون	91
كفعال	91
كعروة	91
كالوفرة	91
كالمستهزىء	91


    title = "The Interplay of Variant, Size, and Task Type in {A}rabic Pre-trained Language Models",
    author = "Inoue, Go  and
      Alhafni, Bashar  and
      Baimukan, Nurpeiis  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Online)",
    publisher = "Association for Computational Linguistics",
