Awesome
trigrams
Trigrams for 460+ languages.
Contents
What is this?
This package exposes all trigrams for natural languages.
Based on the most translated copyright-free document on this planet: UDHR.
When should I use this?
When you are dealing with natural language detection.
Install
This package is ESM only.
In Node.js (version 14.14+, 16.0+), install with npm:
npm install trigrams
In Deno with esm.sh
:
import {top, min} from 'https://esm.sh/trigrams@5'
In browsers with esm.sh
:
<script type="module">
import {top, min} from 'https://esm.sh/trigrams@5?bundle'
</script>
Use
import {top, min} from 'trigrams'
console.log((await top()).pam)
console.log((await min()).nld)
Yields:
{ // 300 top trigrams.
'isa': 6,
'upa': 6,
'i k': 6,
// …
'ang': 273,
'ing': 282,
'ng ': 572 // Most common trigram with how often it was found.
}
[ // 300 top trigrams.
' ar',
'eer',
'tij',
// …
'de ',
'an ',
'en ' // Most common trigram.
]
API
This package exports the identifiers top
and min
.
There is no default export.
top()
Get top trigrams to occurrence counts.
Returns
Returns a promise resolving to an object mapping UDHR in Unicode
codes to objects mapping the top 300 trigrams to occurrence counts
(Promise<Record<string, Record<string, number>>>
).
min()
Get top trigrams.
Returns
Returns a promise resolving to arrays containing the top 300 trigrams sorted
from least occurring to most occurring
(Promise<Record<string, Array<string>>>
).
Data
The trigrams are based on the unicode versions of the universal declaration
of human rights.
The files are created from all paragraphs made available by
wooorm/udhr
and do not include headings and such.
Before creating trigrams,
- the unicode characters from
\u0021
to \u0040
(both including) are
removed
- one or more white space characters (
\s+
) are replaced with a single space
- alphabetic characters are lower cased (
[A-Z]
)
Additionally, the input is padded with two spaces on both sides.
<!--support start-->
Code | Name |
---|
007 | Sãotomense |
008 | Crioulo, Upper Guinea (008) |
009 | Mbundu (009) |
010 | Tetun Dili |
011 | Umbundu (011) |
013 | (Mijisa) |
014 | (Maiunan) |
016 | (Minjiang, spoken) |
017 | (Minjiang, written) |
020 | Drung |
021 | (Muzzi) |
022 | (Klau) |
025 | (Bizisa) |
026 | (Yeonbyeon) |
027 | Gumuz |
028 | Kafa |
029 | Sidamo |
030 | Kituba (2) |
032 | South Azerbaijani |
041 | Latvian (2) |
042 | Spanish (resolution) |
043 | Zarma |
aar | Afar |
abk | Abkhaz |
ace | Aceh |
acu | Achuar-Shiwiar |
acu_1 | Achuar-Shiwiar (1) |
ada | Dangme |
ady | Adyghe |
afr | Afrikaans |
agr | Aguaruna |
aii | Assyrian Neo-Aramaic |
ajg | Aja |
aka_akuapem | Twi (Akuapem) |
aka_asante | Twi (Asante) |
aka_fante | Fante |
als | Albanian, Tosk |
alt | Altai, Southern |
amc | Amahuaca |
ame | Yaneshaʼ |
amh | Amharic |
ami | Amis |
amr | Amarakaeri |
arb | Arabic, Standard |
arl | Arabela |
arn | Mapudungun |
ast | Asturian |
auc | Waorani |
auv | Occitan (Auvergnat) |
ayr | Aymara, Central |
azj_cyrl | Azerbaijani, North (Cyrillic) |
azj_latn | Azerbaijani, North (Latin) |
bam | Bamanankan |
ban | Bali |
bax | Bamun |
bba | Baatonum |
bci | Baoulé |
bcl | Bicolano, Central |
bel | Belarusan |
bem | Bemba |
ben | Bengali |
bfa | Bari |
bho | Bhojpuri |
bin | Edo |
bis | Bislama |
blt | Tai Dam |
blu | Hmong Njua |
boa | Bora |
bod | Tibetan, Central |
bos_cyrl | Bosnian (Cyrillic) |
bos_latn | Bosnian (Latin) |
bre | Breton |
btb | Bulu |
buc | Bushi |
bug | Bugis |
bul | Bulgarian |
cab | Garifuna |
cak | Kaqchikel, Central |
cat | Catalan-Valencian-Balear |
cbi | Chachi |
cbr | Cashibo-Cacataibo |
cbs | Cashinahua |
cbt | Chayahuita |
cbu | Candoshi-Shapra |
ccx | Zhuang, Yongbei |
ceb | Cebuano |
ces | Czech |
cha | Chamorro |
chj | Chinantec, Ojitlán |
chk | Chuukese |
chr_cased | Cherokee (cased) |
chr_uppercase | Cherokee (uppercase) |
chv | Chuvash |
cic | Chickasaw |
cjk | Chokwe |
cjk_AO | Chokwe (Angola) |
cjs | Shor |
ckb | Kurdish, Central |
cnh | Chin, Haka |
cni | Asháninka |
cnr | Montenegrin |
cof | Colorado |
cos | Corsican |
cot | Caquinte |
cpu | Ashéninka, Pichis |
crh | Crimean Tatar |
crs | Seselwa Creole French |
csa | Chinantec, Chiltepec |
csw | Cree, Swampy |
ctd | Chin, Tedim |
cym | Welsh |
dag | Dagbani |
dan | Danish |
ddn | Dendi |
deu_1901 | German, Standard (1901) |
deu_1996 | German, Standard (1996) |
dga | Dagaare, Southern |
dip | Dinka, Northeastern |
div | Maldivian |
dyo | Jola-Fonyi |
dyu | Jula |
dzo | Dzongkha |
ell_monotonic | Greek (monotonic) |
ell_polytonic | Greek (polytonic) |
emk | Maninkakan, Eastern |
eml | Romagnolo |
eng | English |
epo | Esperanto |
ese | Ese Ejja |
est | Estonian |
eus | Basque |
eve | Even |
evn | Evenki |
ewe | Éwé |
fao | Faroese |
fij | Fijian |
fin | Finnish |
fkv | Finnish, Kven |
flm | Chin, Falam |
fon | Fon |
fra | French |
fri | Frisian, Western |
fuf | Pular |
fur | Friulian |
fuv | Fulfulde, Nigerian |
fuv2 | Fulfulde, Nigerian (2) |
fvr | Fur |
gaa | Ga |
gag | Gagauz |
gax | Oromo, Borana-Arsi-Guji |
gjn | Gonja |
gkp | Kpelle, Guinea |
gla | Gaelic, Scottish |
gld | Nanai |
gle | Gaelic, Irish |
glg | Galician |
glv | Manx |
gsw1 | Alemannisch (Elsassisch) |
guc | Wayuu |
gug | Guaraní, Paraguayan |
guj | Gujarati |
guu | Yanomamö |
gyr | Guarayu |
hat_kreyol | Haitian Creole French (Kreyol) |
hat_popular | Haitian Creole French (Popular) |
hau_NE | Hausa (Niger) |
hau_NG | Hausa (Nigeria) |
hau_3 | Hausa |
haw | Hawaiian |
hea | Hmong, Northern Qiandong |
heb | Hebrew |
hil | Hiligaynon |
hin | Hindi |
hlt | Chin, Matu |
hms | Hmong, Southern Qiandong |
hna | Gen |
hni | Hani |
hns | Hindustani, Sarnami |
hrv | Croatian |
hsb | Sorbian, Upper |
hsf | Huastec (Sierra de Otontepec) |
hun | Hungarian |
hus | Huastec (Veracruz) |
huu | Huitoto, Murui |
hva | Huastec (San Luís Potosí) |
hye | Armenian |
ibb | Ibibio |
ibo | Igbo |
ido | Ido |
idu | Idoma |
ijs | Ijo, Southeast |
ike | Inuktitut, Eastern Canadian |
ilo | Ilocano |
ina | Interlingua |
ind | Indonesian |
isl | Icelandic |
ita | Italian |
jav | Javanese (Latin) |
jav_java | Javanese (Javanese) |
jiv | Shuar |
jpn | Japanese |
jpn_osaka | Japanese (Osaka) |
jpn_tokyo | Japanese (Tokyo) |
kaa | Karakalpak |
kal | Inuktitut, Greenlandic |
kan | Kannada |
kat | Georgian |
kaz | Kazakh |
kbd | Kabardian |
kbp | Kabiyé |
kde | Makonde |
kdh | Tem |
kea | Kabuverdianu |
kek | Q'eqchi' |
kha | Khasi |
khk | Mongolian, Halh (Cyrillic) |
khm | Khmer, Central |
kin | Rwanda |
kir | Kirghiz |
kjh | Khakas |
kkh_lana | Khün |
kmb | Mbundu |
kmr | Kurdish, Northern |
knc | Kanuri, Central |
kng | Koongo |
kng_AO | Koongo (Angola) |
koi | Komi-Permyak |
koo | Konjo |
kor | Korean |
kqn | Kaonde |
kqs | Kissi, Northern |
kri | Krio |
krl | Karelian |
ktu | Kituba |
kwi | Awa-Cuaiquer |
lad | Ladino |
lao | Lao |
lat | Latin |
lat_1 | Latin (1) |
lav | Latvian |
lia | Limba, West-Central |
lij | Ligurian |
lin | Lingala |
lin_tones | Lingala (tones) |
lit | Lithuanian |
lld | Ladin |
lnc | Occitan (Languedocien) |
lns | Lamnso' |
lob | Lobi |
lot | Otuho |
loz | Lozi |
ltz | Luxembourgeois |
lua | Luba-Kasai |
lue | Luvale |
lug | Ganda |
lun | Lunda |
lus | Mizo |
mad | Madura |
mag | Magahi |
mah | Marshallese |
mai | Maithili |
mal | Malayalam |
mal_chillus | Malayalam |
mam | Mam, Northern |
mar | Marathi |
maz | Mazahua Central |
mcd | Sharanahua |
mcf | Matsés |
men | Mende |
mfq | Moba |
mic | Micmac |
min | Minangkabau |
miq | Mískito |
mkd | Macedonian |
mlt | Maltese |
mly_arab | Malay (Arabic) |
mly_latn | Malay (Latin) |
mnw | Mon |
mor | Moro |
mos | Mòoré |
mri | Maori |
mto | Mixe, Totontepec |
mxi | Mozarabic |
mxv | Mixtec, Metlatónoc |
mya | Burmese |
mzi | Mazatec, Ixcatlán |
nav | Navajo |
nba | Nyemba |
nbl | Ndebele |
ndo | Ndonga |
nds | Saxon, Low |
nep | Nepali |
nhn | Nahuatl, Central |
nio | Nganasan |
niu | Niue |
niv | Gilyak |
njo | Naga, Ao |
nku | Kulango, Bouna |
nld | Dutch |
nno | Norwegian, Nynorsk |
nob | Norwegian, Bokmål |
not | Nomatsiguenga |
nso | Sotho, Northern |
nya_chechewa | Nyanja (Chechewa) |
nya_chinyanja | Nyanja (Chinyanja) |
nym | Nyamwezi |
nyn | Nyankore |
nzi | Nzema |
oaa | Orok |
oci_1 | Occitan (Francoprovençal, Fribourg) |
oci_2 | Occitan (Francoprovençal, Savoie) |
oci_3 | Occitan (Francoprovençal, Vaud) |
oci_4 | Occitan (Francoprovençal, Valais) |
ojb | Ojibwa, Northwestern |
oki | Okiek |
orh | Oroqen |
oss | Osetin |
ote | Otomi, Mezquital |
pam | Pampangan |
pan | Panjabi, Eastern |
pap | Papiamentu |
pau | Palauan |
pbb | Páez |
pbu | Pashto, Northern |
pcd | Picard |
pcm | Pidgin, Nigerian |
pes_1 | Farsi, Western |
pes_2 | Dari |
pis | Pijin |
piu | Pintupi-Luritja |
plt | Malagasy, Plateau |
pnb | Panjabi, Western |
pol | Polish |
pon | Pohnpeian |
por_BR | Portuguese (Brazil) |
por_PT | Portuguese (Portugal) |
pov | Crioulo, Upper Guinea |
ppl | Pipil |
prv | Occitan |
quc | K'iche', Central |
qud | Quechua (Unified Quichua, old Hispanic orthography) |
qug | Quichua, Chimborazo Highland |
quy | Quechua, Ayacucho |
quz | Quechua, Cusco |
qva | Quechua, Ambo-Pasco |
qvc | Quechua, Cajamarca |
qvh | Quechua, Huamalíes-Dos de Mayo Huánuco |
qvm | Quechua, Margos-Yarowilca-Lauricocha |
qvn | Quechua, North Junín |
qwh | Quechua, Huaylas Ancash |
qxa | Quechua, South Bolivian |
qxn | Quechua, Northern Conchucos Ancash |
qxu | Quechua, Arequipa-La Unión |
rar | Rarotongan |
rmn | Romani, Balkan |
rmn_1 | Romani, Balkan (1) |
rmy | Aromanian |
roh | Romansch |
roh_puter | Romansch (Puter) |
roh_rumgr | Romansch (Grischun) |
roh_surmiran | Romansch (Surmiran) |
roh_sursilv | Romansch (Sursilvan) |
roh_sutsilv | Romansch (Sutsilvan) |
roh_vallader | Romansch (Vallader) |
ron_1953 | Romanian (1953) |
ron_1993 | Romanian (1993) |
ron_2006 | Romanian (2006) |
run | Rundi |
rus | Russian |
sag | Sango |
sah | Yakut |
san | Sanskrit |
sco | Scots |
sey | Secoya |
shk | Shilluk |
shn | Shan |
shp | Shipibo-Conibo |
sin | Sinhala |
skr | Seraiki |
slk | Slovak |
slr | Salar |
slv | Slovenian |
sme | Saami, North |
smo | Samoan |
sna | Shona |
snk | Soninke |
snn | Siona |
som | Somali |
sot | Sotho, Southern |
spa | Spanish |
src | Sardinian, Logudorese |
srp_cyrl | Serbian (Cyrillic) |
srp_latn | Serbian (Latin) |
srr | Serer-Sine |
ssw | Swati |
suk | Sukuma |
sun | Sunda |
sus | Susu |
swb | Comorian, Maore |
swe | Swedish |
swh | Swahili |
tah | Tahitian |
tam | Tamil |
tam_LK | Tamil (Sri Lanka) |
tat | Tatar |
tbz | Ditammari |
tca | Ticuna |
tel | Telugu |
tem | Themne |
tet | Tetun |
tgk | Tajiki |
tgl | Tagalog |
tha | Thai |
tha2 | Thai (2) |
tir | Tigrigna |
tiv | Tiv |
tly | Talysh |
tob | Toba |
toi | Tonga |
toj | Tojolabal |
ton | Tongan |
top | Totonac, Papantla |
tpi | Tok Pisin |
tsn | Tswana |
tso_MZ | Tsonga (Mozambique) |
tso_ZW | Tsonga (Zimbabwe) |
tsz | Purepecha |
tuk_cyrl | Turkmen (Cyrillic) |
tuk_latn | Turkmen (Latin) |
tur | Turkish |
tyv | Tuva |
tzc | Tzotzil (Chamula) |
tzh | Tzeltal, Oxchuc |
tzm | Tamazight, Central Atlas |
udu | Uduk |
uig_arab | Uyghur (Arabic) |
uig_latn | Uyghur (Latin) |
ukr | Ukrainian |
umb | Umbundu |
ura | Urarina |
urd | Urdu |
urd_2 | Urdu (2) |
uzn_cyrl | Uzbek, Northern (Cyrillic) |
uzn_latn | Uzbek, Northern (Latin) |
vai | Vai |
vec | Venetian |
ven | Venda |
ven2 | Venda |
vep | Veps |
vie | Vietnamese |
vmw | Makhuwa |
war | Waray-Waray |
wln | Walloon |
wol | Wolof |
wwa | Waama |
xho | Xhosa |
xsm | Kasem |
yad | Yagua |
yao | Yao |
yap | Yapese |
ydd | Yiddish, Eastern |
ykg | Yukaghir, Northern |
yor | Yoruba |
yrk | Nenets |
yua | Maya, Yucatán |
zam | Zapotec, Miahuatlán |
zdj | Comorian, Ngazidja |
zgh | Tamazight, Standard Morocan |
zro | Záparo |
ztu | Zapotec, Güilá |
zul | Zulu |
<!--support end-->
Types
This package is fully typed with TypeScript.
It exports no additional types.
Compatibility
This package is at least compatible with all maintained versions of Node.js.
As of now, that is Node.js 14.14+ and 16.0+.
It also works in Deno and modern browsers.
Contribute
Yes please!
See How to Contribute to Open Source.
Security
This package is safe.
License
MIT © Titus Wormer
<!-- Definitions -->