Home

Awesome

Software Documentation Data Set for Machine Translation

REUSE status

A parallel evaluation data set of SAP software documentation with document structure annotation

Overview

The data in this data set originates from the SAP Help Portal that contains documentation for SAP products and user assistance for product-related questions. The current language scope is English to Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, Simplified Chinese and Traditional Chinese. The data may be used for research purposes only.

The data has been processed in a way that makes it suitable as development and test data for machine translation purposes. For each language pair about 4k segments are available, split into development and test data. The segments are provided in their document context and are annotated with additional metadata from the document. The details are described below. More information can be found in Buschbeck and Exel (2020).

The data for the language pairs English to Japanese, Korean, Simplified and Traditional Chinese is special in two ways:

The software documentation data set for machine translation was initially released by SAP for the 7th Workshop on Asian Translation (WAT 2020). It was also used in the 8th Workshop on Asian Translation (WAT 2021). It was extended with additional language pairs and structured documents for the 9th Workshop on Asian Translation (WAT 2022). Note that for this extension currently only the development data has been released. The test data will follow shortly before the deadline of the shared task.

Language scope

Language pairAbbreviationStructured documents
English - Hindienhi-
English - Indonesianenid-
English - Malayenms-
English - Thaienth-
English - Vietnameseenvi-
English - Japaneseenjayes
English - Koreanenkoyes
English - Simplified Chineseenzhyes
English - Traditional Chineseenzfyes

Data Format

The plain-text data is represented in three text files that are aligned on segment level: source, target and metadata. xx stands for the respective target language. All files are utf-8 encoded.

File nameContent
software_documentation.[dev|test].enxx.ensource segments of development/test set
software_documentation.[dev|test].enxx.xxtarget segments of development/test set
software_documentation.[dev|test].enxx.metametadata of the source-target pairs of the development/test set (tab separated)

The data that is available as full structured documents is provided in XLIFF format, one file per document, in a documents subfolder per language pair. For convenience, we also provide all translatables segments concatenated, with inline tags in the original DITA format.

File nameContent
software_documentation.source-text-dita-translatables.[dev|test].enxx.ensource segments of development/test set with DITA inline markup (if applicable)
software_documentation.target-text-dita-translatables.[dev|test].enxx.xxtarget segments of development/test set with DITA inline markup (if applicable)

Document context metadata

For each segment plain-text pair, positional metadata was recorded to serve the goal of providing context information. It is available in the *.meta file, aligned with the source and target segments, containing the following 5 columns: 

ColumnDescription of column content
1Document ID
2Segment ID in the document that indicates the contextual order (restarts from 1 in each document)
3Text Unit ID in the document that indicates segments that occur in consecutive order (starts from 1 in each document). Segments with the same Text Unit ID make up one text block consisting of multiple sentences, for example a paragraph.
4Segment ID in Text Unit (starts from 1 in each Text Unit)
5Textual element that describes the structural type of the segment. Values are title, section, table_element, list_element, example, unspecified

Structured documents

For enja, enko, enzf and enzh, the source and translated data is also provided as complete structured documents including inline markup in XLIFF (.xlf) format. File names correspond to document IDs (column 1 in the metadata file). XLIFF is an XML-based format for storing bitext which was created to standardize the way localizable data is passed between tools in a localization process.

This data has been created by converting it from the original DITA format. Much of the original DITA format can be restored by literally using the DITA tags masked by XLIFF tags (ph, bpt, ept).

Context Structure

The textual element information in column 5 of the metadata files can also be found in the XLIFF files inside the context-group/context element of the nearest group. In adddition to the types in the .meta files the following types can also appear for "non-text" trans-units: concept, code, related, prolog.

Locked references

The documents contain certain placeholders that reference textual content inside or outside the document. In the plain-text data, they have been replaced by <locked-ref> as just removing them would render the segments incomplete and ungrammatical.

In the structured documents, they are represented by <mrk mtype="protected"> tags and the "hidden" information has been re-inserted inline.

Transforming XLIFF files

The XLIFF files can be transformed to more suitable representations for different purposes. As an example, a set of xsl stylesheets is provided to transform the XLIFF files to simpler formats. These can be found in the folder tools and can be automatically applied by running SAXONJAR=/the/path/to/saxon-he.jar ./tools/apply-all.sh.

The provided stylesheets perform the following transformations:

For convenience the results of applying XLF12_to_source_text-dita-translatables.xsl and XLF12_to_target_text-dita-translatables.xsl and then concatenating all source/target documents is provided as software_documentation.source-text-dita-translatables.[dev\|test].enxx.en and software_documentation.target-text-dita-translatables.[dev\|test].enxx.xx respectively.

Perquisites: For applying the stylesheets an installation of Java and a copy of Saxon is needed.

There also exist multiple open source libraries to process XLIFF files. When working in Java, the Okapi Framework provides good support for handling XLIFF files.

Particularities

License

This project is licensed under Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) except as noted otherwise in the LICENSE file.

Please cite

Bianka Buschbeck and Miriam Exel (2020). "A parallel evaluation data set of software documentation with document structure annotation".

when you use this data set.

Detailed information including third-party components and their licensing/copyright information is available via the REUSE tool

Contributing

We welcome contributions to this project. Please see the contribution guidelines for more details on how to contribute.