Home

Awesome

<!-- - M#2-nek megfelelő README.md legyen = maradjunk meg a production release-re (jelenleg ez a MILESTONE#2) vonatkozó infók közlésénél, minden, ami azon túl van, az legyen a [Work in progress](#work-in-progress) részben (jövőbeli MILESTONE-ok szerint). B) Vagy csináljunk egy linket a megfelelő kommitra, ami a MILESTONE#X README.md-t mutatja. Ezért hülyeség lenne kétszer dokumentálni a dolgokat. C) branch-be fejlesztünk és a MILESTONE-oknál merge-lünk. -->

e-magyar text processing system (emtsv)

17 Sep 2019 MILESTONE#4 (production) = xtsv and dummyTagger separated, UDPipe, Hunspell added, many API breaks compared to the previous milestone, Runnable Docker image, dropped DepToolPy, and many more changes.

If a bug is found please leave feedback with the exact details.

Citing and License

See docs/cite.bib for BibTeX entries.

If you use the emtsv system, please cite:

Simon Eszter, Indig Balázs, Kalivoda Ágnes, Mittelholcz Iván, Sass Bálint, Vadász Noémi. Újabb fejlemények az e-magyar háza táján. In: Berend Gábor, Gosztolya Gábor, Vincze Veronika (szerk.): MSZNY 2020, XVI. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2020). Szeged: Szegedi Tudományegyetem Informatikai Tanszékcsoport, 29-42.

Balázs Indig, Bálint Sass, Eszter Simon, Iván Mittelholcz, Noémi Vadász, and Márton Makrai: One format to rule them all – The emtsv pipeline for Hungarian. In: Proceedings of the 13th Linguistic Annotation Workshop. Association for Computational Linguistics, 2019, 155-165.

Indig Balázs, Sass Bálint, Simon Eszter, Mittelholcz Iván, Kundráth Péter, Vadász Noémi. emtsv – Egy formátum mind felett. In: Berend Gábor, Gosztolya Gábor, Vincze Veronika (szerk.): MSZNY 2019, XV. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2019). Szeged: Szegedi Tudományegyetem Informatikai Tanszékcsoport, 235-247.

Váradi Tamás, Simon Eszter, Sass Bálint, Mittelholcz Iván, Novák Attila, Indig Balázs: e-magyar – A Digital Language Processing System. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 1307-1312.

Váradi Tamás, Simon Eszter, Sass Bálint, Gerőcs Mátyás, Mittelholcz Iván, Novák Attila, Indig Balázs, Prószéky Gábor, Farkas Richárd, Vincze Veronika: Az e-magyar digitális nyelvfeldolgozó rendszer. In: MSZNY 2017, XIII. Magyar Számítógépes Nyelvészeti Konferencia, Szeged: Szegedi Tudományegyetem Informatikai Tanszékcsoport, 49-60.

If you use emMorph module, please cite:

Attila Novák, Borbála Siklósi, Csaba Oravecz: A New Integrated Open-source Morphological Analyzer for Hungarian. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris: European Language Resources Association (ELRA).

Attila Novák: A New Form of Humor -- Mapping Constraint-Based Computational Morphologies to a Finite-State Representation. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), Paris: European Language Resources Association (ELRA).

If you use emTag (PurePos) module, please cite:

Orosz György, Novák Attila: PurePos 2.0: a hybrid tool for morphological disambiguation. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2013), 539-545.

If you use emBERT module, please cite:

Nemeskey Dávid Márk: Egy emBERT próbáló feladat. In: MSZNY 2020, XVI. Magyar Számítógépes Nyelvészeti Konferencia. Szeged: Szegedi Tudományegyetem Informatikai Tanszékcsoport, 409-418.

If you use emPreverb or emCompound module, please cite:

Pethő Gergely, Sass Bálint, Kalivoda Ágnes, Simon László, Lipp Veronika: Igekötő-kapcsolás. In: MSZNY 2022, XVIII. Magyar Számítógépes Nyelvészeti Konferencia. Szeged: Szegedi Tudományegyetem Informatikai Intézet, 77-91.

The emtsv system is a replacement for the original https://github.com/nytud/hunlp-GATE system.

emtsv is licensed under LGPL 3.0. The submodules have their own licenses.

Requirements

Installation

Usage

Here we present the usage scenarios. The individual modules are documented in detail below. <br/> To extend the toolchain with new modules just add new modules to config.py.

#1 – by REST API

Start the emtsv server

Use the emtsv server

Given that the server is running:

#2 – directly by CLI

#3 – as Python Library

  1. Install emtsv in emtsv directory or make sure the emtsv installation is in the PYTHONPATH environment variable
  2. import emtsv
  3. Now the full API is accessible (see the example above):
    import sys
    from emtsv import build_pipeline, jnius_config, tools, presets, process, pipeline_rest_api, singleton_store_factory
    
    jnius_config.classpath_show_warning = False  # To suppress warning
    
    # Imports end here. Must do only once per Python session
    
    # Set input from any stream or iterable and output stream...
    input_data = sys.stdin
    output_iterator = sys.stdout
    # Raw, or processed TSV input list and output file...
    # input_data = iter(['A kutya', 'elment sétálni.'])  # Raw text line by line
    # Processed data: header and the token POS-tag pairs line by line
    # input_data = iter([['form', 'xpostag'], ['A', '[/Det|Art.Def]'], ['kutya', '[/N][Nom]'], ['elment', '[/V][Pst.NDef.3Sg]'], ['sétálni', '[/V][Inf]'], ['.', '.']])
    # output_iterator = open('output.txt', 'w', encoding='UTF-8')  # File
    # input_data = 'A kutya elment sétálni.'  # Or raw string in any acceptable format.
    
    # Select a predefined task to do or provide your own list of pipeline elements
    used_tools = ['tok', 'morph', 'pos']
    
    conll_comments = True  # Enable the usage of CoNLL comments
    
    # Run the pipeline on input and write result to the output...
    output_iterator.writelines(build_pipeline(input_data, used_tools, tools, presets, conll_comments))
    
    # Alternative: Run specific tool for input streams (still in emtsv format).
    # Useful for training a module (see Huntag3 for details):
    output_iterator.writelines(process(sys.stdin, an_inited_tool))
    
    # Or process individual tokens further... WARNING: The header will be the first item in the iterator!
    for tok in build_pipeline(input_data, used_tools, tools, presets, conll_comments):
        if len(tok) > 1:  # Empty line (='\n') means end of sentence
            form, xpostag, *rest = tok.strip().split('\t')  # Split to the expected columns
    
    # Alternative2: Flask application (REST API)
    singleton_store = singleton_store_factory()
    app = application = pipeline_rest_api(name='e-magyar-tsv', available_tools=tools, presets=presets,
                                conll_comments=conll_comments, singleton_store=singleton_store,
                                form_title='e-magyar text processing system',
                                doc_link='https://github.com/nytud/emtsv')
    # And run the Flask debug server separately
    app.run()
    

The public API is equivalent to the xtsv API

Data format

Please see the specification in detail in the xtsv documentation.

For examples see files in tests/test_input and tests/test_output directories.

Modules

Modules are defined in config.py. The current toolchain consists of the following modules which can be called by their names (or using their shorthand names in brackets):

For an overview see the topology of the current toolchain

The following presets are defined as shorthands for the common tasks:

Example

The examples presented above simply call main.py with the parameters tok,spell,morph,pos,conv-morph,dep,chunk,ner take the input on STDIN and return the output on STDOUT. We use here a tokenizer, a morphological analyzer, a POS tagger, a morphology converter, a dependency parser, a chunker and a named entity recognizer. (The converter is needed as the POS tagger and the dependency parser work with different morphological coding systems.)

Creating a module

<!-- ## Work in progress _WARNING:_ Everything below is at most in beta (or just a plan which may be realized or not). Things below may break without further notice! for __SOMEDAY__: - `emCons` (works but rather slow) - `CoNLL-U importer` -->

Further documentation