Awesome
Ersatz is a simple, language-agnostic toolkit for both training sentence segmentation models as well as providing pretrained, high-performing models for sentence segmentation in a multilingual setting.
For more information, please see:
- Rachel Wicks and Matt Post (2021): A unified approach to sentence segmentation of punctuated text in many languages In Proceedings of ACL.
QUICK START
Install
Install the Python (3.7+) module via pip
pip install ersatz
or from source
python setup.py install
Splitting
Ersatz can accept input from either standard input, or via a file path. Similarly, it produces output in the same manner:
cat raw.txt | ersatz > output.txt
ersatz --input raw.txt --output output.txt
To use a specific model (rather than the default), you can pass a name via --model_name
, or a path via --model_path
Scoring
Ersatz also provides a simple scoring script which computes F1 from a given segmented file.
ersatz_score GOLD_STANDARD_FILE FILE_TO_SCORE
The above will print all errors as well as additional metrics at bottom. The accompanying test suite can be found here.
Training a Model
Data Preprocessing
Vocabulary
Requires uses a pretrained sentencepiece
model that has had --eos_piece
replaced with <eos>
and --bos_piece
replaced with <mos>
.
spm_train --input $TRAIN_DATA_PATH \
--model_prefix ersatz \
--bos_piece "<mos>" \
--eos_piece "<eos>"
Create training data
This pipeline takes a raw text file with one sentence per line (to use as labels) and creates a new raw text file
with the appropriate left/right context and labels. One line is one training example. User is expected to shuffle this
file manually (ie via shuf
) after creation.
- To create:
python dataset.py \
--sentencepiece_path $SPM_PATH \
--left-size $LEFT_SIZE \
--right-size $RIGHT_SIZE \
--output_path $OUTPUT_PATH \
$INPUT_TRAIN_FILE_PATHS
shuf $OUTPUT_PATH > $SHUFFLED_TRAIN_OUTPUT_PATH
- Repeat for validation data
python dataset.py \
--sentencepiece_path $SPM_PATH \
--left-size $LEFT_SIZE \
--right-size $RIGHT_SIZE \
--output_path $VALIDATION_OUTPUT_PATH \
$INPUT_DEV_FILE_PATHS
Training
Something like:
python trainer.py \
--sentencepiece_path=$vocab_path \
--left_size=$left_size \
--right_size=$right_size \
--output_path=$out \
--transformer_nlayers=$transformer_nlayers \
--activation_type=$activation_type \
--linear_nlayers=$linear_nlayers \
--min-epochs=$min_epochs \
--max-epochs=$max_epochs \
--lr=$lr \
--dropout=$dropout \
--embed_size=$embed_size \
--factor_embed_size=$factor_embed_size \
--source_factors \
--nhead=$nhead \
--log_interval=$log_interval \
--validation_interval=$validation_interval \
--eos_weight=$eos_weight \
--early_stopping=$early_stopping \
--tb_dir=$LOGDIR \
$train_path \
$valid_path
Splitting with a Pre-Trained Model
- Expects a
model_path
(should probably change to a default in expected folder location...) ersatz
reads from either stdin or a file path (via--input
).ersatz
writes to either stdout or a file path (via--output
).- An alternate candidate set for splitting may be given using
--determiner_type
multilingual
(default) is as described in paperen
requires a space following punctuationall
a space between any two characters- Custom can be written that uses the
determiner.Split()
base class
- By default, expects raw sentences. Splitting a
.tsv
is also a supported behavior.--text_ids
expects a comma separated list of column indices to split--delim
changes the delimiter character (default is\t
)
- Uses gpu if available, to force cpu, use
--cpu
Example usage
Typical python usage:
python split.py --input unsegmented.txt --output sentences.txt ersatz.model
std[in,out] usage:
cat unsegmented.txt | split.py ersatz.model > sentences.txt
To split .tsv
file:
cat unsegmented.tsv | split.py ersatz.model --text_ids 1 > sentences.txt
Scoring a Model's Output
python score.py [gold_standard_file_path] [file_to_score]
(There are legacy arguments, but they're not used)
Changelog
1.0.0 original release