Home

Awesome

xtsv – A generic TSV-style format based intermodular communication framework and REST API implemented in Python

If a bug is found please leave feedback with the exact details.

Citing and License

xtsv is licensed under the LGPL 3.0 license. The submodules have their own license.

If you use this library, please cite the following paper:

Indig, Balázs, Bálint Sass, and Iván Mittelholcz. "The xtsv Framework and the Twelve Virtues of Pipelines." Proceedings of The 12th Language Resources and Evaluation Conference. 2020.

@inproceedings{indig-etal-2020-xtsv,
    title = "The xtsv Framework and the Twelve Virtues of Pipelines",
    author = "Indig, Bal{\'a}zs  and
      Sass, B{\'a}lint  and
      Mittelholcz, Iv{\'a}n",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.871",
    pages = "7044--7052",
    abstract = "We present xtsv, an abstract framework for building NLP pipelines. It covers several kinds of functionalities which can be implemented at an abstract level. We survey these features and argue that all are desired in a modern pipeline. The framework has a simple yet powerful internal communication format which is essentially tsv (tab separated values) with header plus some additional features. We put emphasis on the capabilities of the presented framework, for example its ability to allow new modules to be easily integrated or replaced, or the variety of its usage options. When a module is put into xtsv, all functionalities of the system are immediately available for that module, and the module can be be a part of an xtsv pipeline. The design also allows convenient investigation and manual correction of the data flow from one module to another. We demonstrate the power of our framework with a successful application: a concrete NLP pipeline for Hungarian called e-magyar text processing system (emtsv) which integrates Hungarian NLP tools in xtsv. All the advantages of the pipeline come from the inherent properties of the xtsv framework.",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

Requirements

API documentation

To be defined by the actual pipeline:

Data format

The input and output can be one of the following:

The TSV files are formatted as follows (closely resembling the CoNLL-U, vertical format):

The fields (represented by TSV columns) are identified by the header in the first line of the input. Each module can (but does not necessarily have to) define:

The following types of modules can be defined by their input and output format requirements:

Creating a module that can be used with xtsv

We strive to be a welcoming open source community. In agreement with the license, everybody is free to create a new compatible module without asking for permission.

The following requirements apply for a new module:

  1. It must provide (at least) the mandatory API (see emDummy for a well-documented example)
  2. It must conform to the (to be defined) field-name conventions and the format conventions
  3. It must have an LGPL 3.0 compatible license (as all modules communicate through the thin xtsv API, there is no restriction or obligation to commit for the module license. This is not legal advice!)

The following technical steps are needed to insert the new module into the pipeline:

  1. Add the new module package as a requirement to the requirements.txt of the pipeline's main repository (e.g. emtsv)

  2. Insert the configuration in config.py:

    # Setup the tuple:
    #   module name,
    #   class,
    #   friendly name,
    #   args (tuple),
    #   kwargs (dict)
    em_dummy = (
        'emdummy',
        'EmDummy',
        'EXAMPLE (The friendly name of DummyTagger used in REST API form)',
        ('Params', 'goes', 'here'),
        {
            'source_fields': {'Source field names'},
            'target_fields': ['Target field names'],
            'other': 'kwargs as needed',
        }
    )
    
  3. Add the new module to tools list in config.py, optionally also to presets dictionary

    tools = [
        ...,
        (em_dummy, ('dummy-tagger', 'emDummy')),
    ]
    
  4. Update README.md with the short description of the newly added module and add neccessary documentaion (e.g. extra installation instructions)

  5. Test, commit and push (create a pull request if you want to include your module in other's pipeline)

Installation

Usage

Here we present the usage scenarios.

To extend the toolchain with new modules, just add new modules to config.py.

Some examples of the realised applications:

Command-line interface

Docker image

With the appropriate Dockerfile xtsv can be used as follows

As service through Rest API (docker container)

docker run --rm -p5000:5000 -it xtsv-docker  # REST API listening on http://0.0.0.0:5000

REST API

Server

Client

As Python Library

  1. Install xtsv package or make sure the main pipeline's installation is in the PYTHONPATH environment variable.

  2. import xtsv

  3. Example:

    import sys
    from xtsv import build_pipeline, parser_skeleton, jnius_config, process, pipeline_rest_api, singleton_store_factory
    # Imports end here. Must do only once per Python session
    
    argparser = parser_skeleton(description='An example pipeline for xtsv')
    opts = argparser.parse_args()
    
    jnius_config.classpath_show_warning = opts.verbose  #  False to suppress warning
    
    # Set input from any stream, iterator or raw string in any acceptable format
    if opts.input_text is not None:
        # Raw, or processed TSV input list and output file...
        # input_data = ['A kutya', 'elment sétálni.']  # Raw text line by line
        # Processed data: header and the token POS-tag pairs line by line
        # input_data = [['form', 'xpostag'], ['A', '[/Det|Art.Def]'], ['kutya', '[/N][Nom]'], ['elment', '[/V][Pst.NDef.3Sg]'], ['sétálni', '[/V][Inf]'], ['.', '.']]
        input_data = opts.input_text
    else:
        # Set input from any stream or iterable and output stream...
        input_data = opts.input_stream
    
    # Set output iterator: e.g. output_iterator = open('output.txt', 'w', encoding='UTF-8')  # File
    output_iterator = opts.output_stream
    
    # Select a predefined task to do or provide your own list of pipeline elements
    # i.e. set the tagger name as in the _tools dictionary in the config.py_ e.g. used_tools = ['dummy']
    used_tools = ['tools', 'in', 'a', 'list']
    presets = []
    
    # The relevant part of config.py
    # from emdummy import EmDummy
    em_dummy = ('emdummy', 'EmDummy', 'EXAMPLE (The friendly name of EmDummy used in REST API form)',
                ('Params', 'goes', 'here'), {'source_fields': {'form'},  # Source field names
                                             'target_fields': {'star'}})  # Target field names
    tools = [(em_dummy, ('dummy', 'dummy-tagger', 'emDummy'))]
    
    
    # Run the pipeline on input and write result to the output...
    # You can enable or disable CoNLL-U style comments here (default: disabled)
    output_iterator.writelines(build_pipeline(input_data, used_tools, tools, presets, opts.conllu_comments,
                                              opts.output_header))
    
    # Alternative: Run specific tool for input streams (still in emtsv format).
    # Useful for training a module (see Huntag3 for details):
    # e.g. output_iterator.writelines(process(input_data, EmDummy(*em_dummy[3], **em_dummy[4])))
    output_iterator.writelines(process(sys.stdin, Module('with', 'params')))
    
    # Or process individual tokens further... WARNING: The header will be the
    # first item in the iterator!
    for tok in build_pipeline(input_data, used_tools, tools, presets, opts.conllu_comments, opts.output_header):
        if len(tok) > 1:  # Empty line (='\n') means end of sentence
            form, xpostag, *rest = tok.strip().split('\t')  # Split to the expected columns
    
    # Alternative2: Run REST API debug server
    app = pipeline_rest_api(name='TEST', available_tools=tools, presets=presets,
                            conll_comments=opts.conllu_comments, singleton_store=singleton_store_factory(),
                            form_title='TEST TITLE', doc_link='https://github.com/nytud/xtsv',
                            output_header=opts.output_header)
    # And run the Flask debug server separately
    app.run()