Home

Awesome

E4MT

An Essential text processor for Machine Translation task.

A Python wrapper for E4MT is also available (https://github.com/Targoman/E4MTPy).

Table of contents

Motivation

Text preparation is an essential step in training NLP models. This preparation can vary depending on the task and, include one or more of the following: Parsing, POS tagging, lemmatizing, tokenizing, etc. Dealing with Persian language, a series of basic text preparations are required. Normalizing characters is one of them. Since, the Persian alphabet is based on the Arabic alphabet as do many other middle eastern languages, many non-Persian characters have been entered into Persian written text. This causes the same Persian entity to have different written forms which enlarges the vocabulary size of the Persian language.

Another preparation is related to the polymorphic nature of the Persian language. Persian is an agglutinative language and contains a large number of compound words that consist of a base and one or more affixes. Different authors have different manners of dealing with these words. Some use "zwnj"(zero-width non-joiner<0x200c>) character as delimiter between base and affix, some use space character as delimiter and some concatenate the affixes to the base. Again, this cause the same Persian entity to have different written forms. We handle such issues in E4MT. More details can be found in our paper.


Sample codes

After building E4MT, according to setup, it can be used on Linux by executing the following commands:

export E4MTBinPath=E4MT/out/bin
export LD_LIBRARY_PATH=$E4MTBinPath/../lib64
$E4MTBinPath/E4MT -f /path/to/input/file -o /path/to/output/file -l fa -c scripts/E4MT.conf

Possible [options] can be found by:

$E4MTBinPath/E4MT -h

Features

Setup

Follow the steps below to compile the project:

  1. Clone a fresh copy from github:
    git clone --recursive https://github.com/Targoman/E4MT.git
    
  2. Install Qt5, libxml and libz. Assuming a fresh Ubuntu:20.04 installation, following command does the job:
    apt install -y build-essential git libqt5core5a libqt5network5 cmake qt5-qmake qtbase5-dev qt5-default libxml2-dev zlib1g-dev
    
    or if you've installed Opensuse/leap:15, following command can be used:
    zypper install -y which libxml2-devel zlib-devel libQt5Core-devel libQt5Network-devel
    
  3. The project is a standard QMake out-of-source build, which on Linux can be compiled by executing the following commands:
    qmake 
    make -j
    
    In order to compile in debug mode:
    qmake CONFIG+=debug
    make -j
    
    If you dont't need server mode:
    qmake QJsonRPC=0
    make -j
    

License

E4MT is published under the terms of LGPLv3 License