Awesome
REMEDI Tools
This project contains a number of tools that can be used in conjunction with REMEDI, for example tools to do the pre-/post- processing of text prior to translation.
Filter
Takes a directory of JSON files and filters them based on certain properties. The expected format of the JSON is as follows:
{
"source": "The source text in a foreign language",
"translation": "The translated text in the target language",
"rating": 4, # The rating of the translation
"correctedTranslation": "Optional corrected translation (null if translation hasn't been corrected)",
"sourceLanguage": "The source language of the text",
"targetLanguage": "The target language of the text",
"author": "The person providing the rating and/or translation",
"timestamp": 1531304578240 # The timestamp that the rating/correction was provided
}
The command to filter the data is as follows:
java -cp filter-1.0-SNAPSHOT.jar uk.gov.nca.remedi.filter.JsonFilter \
-i [Input Directory] \
-o [Output file root, excluding the language extension] \
-l [Language code for the foreign language] \
-n [Name of the foreign language] \
-c [Cut off (minimum rating) - optional]
The filter will output two files (one English, and one foreign language) containing sentence pairs (one per line), that match the specified language, and have a corrected translation provided or a rating equal to or greater than the cut off.
This tool is licensed under ASLv2.
Processor
Performs pre- and post- processing tasks on text, primarily for use in preparing training data, and being used by the REMEDI processor server.
Contains 5 separate classes which can be ran individually:
CleanParallelCorpus
- Takes a parallel corpus and performs some cleaning tasks on it such as deduplication and removing empty linesPostProcessor
- Takes a translated string, detokenizes it and performs true casing on itPrepareTrainingData
- Tokenizes text from a file, and transliterates Cyrillic characters using the GOST 2002b standardPreProcessor
- Tokenizes and transliterates text, and detects language if requiredShrinkParallelCorpus
- Randomly selects a subset of sentences from a parallel corpus to reduce the size of a dataset
This tool is licensed under ASLv2.
Stanford Truecaser
Uses Stanford CoreNLP to perform true casing of lower case text.
It implements the uk.gov.nca.remedi.utils.TrueCaser
interface, allowing it to be used as part of the post processor
for correcting the case on translation output.
This tool is licensed under GPLv3.