Home

Awesome

Efficient Language Detector

<div align="center">

supported PHP versions license supported languages version

</div>

Efficient language detector (Nito-ELD or ELD) is a fast and accurate natural language detection software, written 100% in PHP, with a speed comparable to fast C++ compiled detectors, and accuracy within the range of the best detectors to date.

It has no dependencies, easy installation, all it's needed is PHP with the mb extension.
ELD is also available (outdated versions) in Javascript and Python.

  1. Installation
  2. How to use
  3. Benchmarks
  4. Databases
  5. Testing
  6. Languages

Changes from ELD v2 to v3:

Installation

$ composer require nitotm/efficient-language-detector

Configuration

It is recommended to use OPcache, specially for the larger databases to reduce load times.
We need to set opcache.interned_strings_buffer, opcache.memory_consumption high enough for each database
Recommended value in parentheses. Check Databases for more info.

php.ini settingSmallMediumLargeExtralarge
memory_limit>= 128>= 340>= 1060>= 2200
opcache.interned...>= 8 (16)>= 16 (32)>= 60 (70)>= 116 (128)
opcache.memory>= 64 (128)>= 128 (230)>= 360 (450)>= 750 (820)

How to use?

detect() expects a UTF-8 string and returns an object with a language property, containing an ISO 639-1 code (or other selected format), or 'und' for undetermined language.

// require_once 'manual_loader.php'; To load ELD without autoloader. Update path.
use Nitotm\Eld\{LanguageDetector, EldDataFile, EldFormat};

// LanguageDetector(databaseFile: ?string, outputFormat: ?string)
$eld = new LanguageDetector(EldDataFile::SMALL, EldFormat::ISO639_1);
// Database files: 'small', 'medium', 'large', 'extralarge'. Check memory requirements
// Formats: 'ISO639_1', 'ISO639_2T', 'ISO639_1_BCP47', 'ISO639_2T_BCP47' and 'FULL_TEXT'
// Constants are not mandatory, LanguageDetector('small', 'ISO639_1'); will also work

$eld->detect('Hola, cómo te llamas?');
// object( language => string, scores() => array<string, float>, isReliable() => bool )
// ( language => 'es', scores() => ['es' => 0.25, 'nl' => 0.05], isReliable() => true )

$eld->detect('Hola, cómo te llamas?')->language;
// 'es'

Languages subsets

Calling langSubset() once, will set the subset. The first call takes longer as it creates a new database, if saving the database file (default), it will be loaded next time we make the same subset.
To use a subset without additional overhead, the proper way is to instantiate the detector with the file saved and returned by langSubset(). Check available Languages below.

// It always accepts ISO 639-1 codes, as well as the selected output format if different.
// langSubset(languages: [], save: true, encode: true); Will return subset file name if saved
$eld->langSubset(['en', 'es', 'fr', 'it', 'nl', 'de']);
// Object ( success => bool, languages => ?array, error => ?string, file => ?string )
// ( success => true, languages => ['en', 'es'...], error => NULL, file => 'small_6_mfss...' )

// to remove the subset
$eld->langSubset();

// The best and fastest way to use a subset, is to load it just like a default database
$eld_subset = new Nitotm\Eld\LanguageDetector('small_6_mfss5z1t');

Other Functions

// if enableTextCleanup(True), detect() removes Urls, .com domains, emails, alphanumerical...
// Not recommended, as urls & domains contain hints of a language, which might help accuracy
$eld->enableTextCleanup(true); // Default is false

// If needed, we can get info of the ELD instance: languages, database type, etc.
$eld->info();

Benchmarks

I compared ELD with a different variety of detectors, as there are not many in PHP.

URLVersionLanguage
https://github.com/nitotm/efficient-language-detector/3.0.0PHP
https://github.com/pemistahl/lingua-py2.0.2Python
https://github.com/facebookresearch/fastText0.9.2C++
https://github.com/CLD2Owners/cld2Aug 21, 2015C++
https://github.com/patrickschur/language-detection5.3.0PHP
https://github.com/wooorm/franc7.2.0Javascript

Benchmarks:

<!--- Time table | | Tatoeba-50 | ELD test | Sentences | Word pairs | Single words | |:--------------------|:------------:|:------------:|:------------:|:------------:|:------------:| | **Nito-ELD-S** | 4.7" | 1.7" | 1.4" | 0.45" | 0.34" | | **Nito-ELD-M** | 5.2" | 1.8" | 1.5" | 0.47" | 0.36" | | **Nito-ELD-L** | 4.3" | 1.5" | 1.2" | 0.40" | 0.32" | | **Nito-ELD-XL** | 4.6" | 1.6" | 1.3" | 0.42" | 0.33" | | **Lingua** | 98" | 27" | 24" | 8.2" | 5.9" | | **fasttext-subset** | 12" | 2.7" | 2.3" | 1.2" | 1.1" | | **fasttext-all** | -- | 2.4" | 2.0" | 0.91" | 0.73" | | **CLD2** | 3.5" | 0.71" | 0.59" | 0.35" | 0.32" | | **Lingua-low** | 37" | 13" | 11" | 3.0" | 2.3" | | **patrickschur** | 227" | 74" | 63" | 18" | 11" | | **franc** | 43" | 10" | 9" | 4.1" | 3.2" | --> <img alt="time table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector/main/misc/table_time_v3.svg"> <!-- Accuracy table | | Tatoeba-50 | ELD test | Sentences | Word pairs | Single words | |:--------------------|:----------:|:------------:|:------------:|:------------:|:------------:| | **Nito-ELD-S** | 96.8% | 99.7% | 99.2% | 90.9% | 75.1% | | **Nito-ELD-M** | 97.9% | 99.7% | 99.3% | 93.0% | 80.1% | | **Nito-ELD-L** | 98.3% | 99.8% | 99.4% | 94.8% | 83.5% | | **Nito-ELD-XL** | 98.5% | 99.8% | 99.5% | 95.4% | 85.1% | | **Lingua** | 96.1% | 99.2% | 98.7% | 93.4% | 80.7% | | **fasttext-subset** | 94.1% | 98.0% | 97.9% | 83.1% | 67.8% | | **fasttext-all** | -- | 97.4% | 97.6% | 81.5% | 65.7% | | **CLD2** * | 92.1% * | 98.1% | 97.4% | 85.6% | 70.7% | | **Lingua-low** | 89.3 | 97.3% | 96.3% | 84.1% | 68.6% | | **patrickschur** | 84.1% | 94.8% | 93.6% | 71.9% | 57.1% | | **franc** | 76.9% | 93.8% | 92.3% | 67.0% | 53.8% | --> <img alt="accuracy table" width="800" src="https://raw.githubusercontent.com/nitotm/efficient-language-detector/main/misc/table_accuracy_v3.svg">

Databases

SmallMediumLargeExtralarge
ProsLowest memoryEquilibratedFastestMost accurate
ConsLeast accurateSlowest (but fast)High memoryHighest memory
File size3 MB10 MB32 MB71 MB
Memory usage76 MB280 MB977 MB2083 MB
Memory usage Cached0.4 MB + OP0.4 MB + OP0.4 MB + OP0.4 MB + OP
OPcache used memory21 MB69 MB244 MB539 MB
OPcache used interned4 MB10 MB45 MB98 MB
Load time Uncached0.14 sec0.5 sec1.5 sec3.4 sec
Load time Cached0.0002 sec0.0002 sec0.0002 sec0.0002 sec
Settings (Recommended)
memory_limit>= 128>= 340>= 1060>= 2200
opcache.interned...*>= 8 (16)>= 16 (32)>= 60 (70)>= 116 (128)
opcache.memory>= 64 (128)>= 128 (230)>= 360 (450)>= 750 (820)

Testing

Default composer install might not include these files. Use --prefer-source to include them.

new Nitotm\Eld\Tests\TestsAutoload();
$ php efficient-language-detector/tests/tests.php # Update path

Languages

am, ar, az, be, bg, bn, ca, cs, da, de, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hr, hu, hy, is, it, ja, ka, kn, ko, ku, lo, lt, lv, ml, mr, ms, nl, no, or, pa, pl, pt, ro, ru, sk, sl, sq, sr, sv, ta, te, th, tl, tr, uk, ur, vi, yo, zh

Amharic, Arabic, Azerbaijani (Latin), Belarusian, Bulgarian, Bengali, Catalan, Czech, Danish, German, Greek, English, Spanish, Estonian, Basque, Persian, Finnish, French, Gujarati, Hebrew, Hindi, Croatian, Hungarian, Armenian, Icelandic, Italian, Japanese, Georgian, Kannada, Korean, Kurdish (Arabic), Lao, Lithuanian, Latvian, Malayalam, Marathi, Malay (Latin), Dutch, Norwegian, Oriya, Punjabi, Polish, Portuguese, Romanian, Russian, Slovak, Slovene, Albanian, Serbian (Cyrillic), Swedish, Tamil, Telugu, Thai, Tagalog, Turkish, Ukrainian, Urdu, Vietnamese, Yoruba, Chinese

am, ar, az-Latn, be, bg, bn, ca, cs, da, de, el, en, es, et, eu, fa, fi, fr, gu, he, hi, hr, hu, hy, is, it, ja, ka, kn, ko, ku-Arab, lo, lt, lv, ml, mr, ms-Latn, nl, no, or, pa, pl, pt, ro, ru, sk, sl, sq, sr-Cyrl, sv, ta, te, th, tl, tr, uk, ur, vi, yo, zh

amh, ara, aze, bel, bul, ben, cat, ces, dan, deu, ell, eng, spa, est, eus, fas, fin, fra, guj, heb, hin, hrv, hun, hye, isl, ita, jpn, kat, kan, kor, kur, lao, lit, lav, mal, mar, msa, nld, nor, ori, pan, pol, por, ron, rus, slk, slv, sqi, srp, swe, tam, tel, tha, tgl, tur, ukr, urd, vie, yor, zho


Donations and suggestions

If you wish to donate for open source improvements, hire me for private modifications, request alternative dataset training, or contact me, please use the following link: https://linktr.ee/nitotm