Home

Awesome

summarus

Tests Status Code Climate

Abstractive and extractive summarization models, mostly for Russian language. Building on top of AllenNLP

You can also checkout the MBART-based Russian summarization model on Huggingface: mbart_ru_sum_gazeta

Based on the following papers:

Contacts

Prerequisites

pip install -r requirements.txt

Commands

train.sh

Script for training a model based on AllenNLP 'train' command.

ArgumentRequiredDescription
-ctruepath to file with configuration
-struepath to directory where model will be saved
-ttruepath to train dataset
-vtruepath to val dataset
-rfalserecover from checkpoint

predict.sh

Script for model evaluation. The test dataset should have the same format as the train dataset.

ArgumentRequiredDefaultDescription
-ttruepath to test dataset
-mtruepath to tar.gz archive with model
-ptruename of Predictor
-cfalse0CUDA device
-LtrueLanguage ("ru" or "en")
-bfalse32size of a batch with test examples to run simultaneously
-Mfalsepath to meteor.jar for Meteor metric
-Tfalsetokenize gold and predicted summaries before metrics calculation
-Dfalsesave temporary files with gold and predicted summaries

summarus.util.train_subword_model

Script for subword model training.

ArgumentDefaultDescription
--train-pathpath to train dataset
--model-pathpath to directory where generated subword model will be saved
--model-typebpetype of subword model, see sentencepiece
--vocab-size50000size of the resulting subword model vocabulary
--config-pathpath to file with configuration for DatasetReader (with parse_set)

Headline generation

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m ria_pgn_24kk.tar.gz -p subwords_summary -L ru 

Results

Train dataset: RIA, test dataset: RIA
ModelR-1-fR-2-fR-L-fBLEU
ria_copynet_10kk40.023.337.5-
ria_pgn_24kk42.325.139.6-
ria_mbart42.825.539.9-
First Sentence24.110.616.7-

Train dataset: RIA, eval dataset: Lenta

ModelR-1-fR-2-fR-L-fBLEU
ria_copynet_10kk25.612.323.0-
ria_pgn_24kk26.412.324.0-
ria_mbart30.314.527.1-
First Sentence25.511.219.2-

Summarization - CNN/DailyMail

Dataset splits:

Models:

Prediction script:

./predict.sh -t <path_to_test_dataset> -m cnndm_pgn_25kk.tar.gz -p words_summary -L en -R

Results:

ModelR-1-fR-2-fR-L-fMETEORBLEU
cnndm_pgn_25kk38.516.533.417.6-

Summarization - Gazeta, russian news dataset

Models:

Prediction scripts:

./predict.sh -t <path_to_test_dataset> -m gazeta_pgn_7kk.tar.gz -p subwords_summary -L ru -T
./predict.sh -t <path_to_test_dataset> -m gazeta_summarunner_3kk.tar.gz -p subwords_summary_sentences -L ru -T

External models:

Results:

ModelR-1-fR-2-fR-L-fMETEORBLEU
gazeta_pgn_7kk29.412.724.621.29.0
gazeta_pgn_7kk_cov29.812.825.422.110.1
gazeta_pgn_25kk29.612.824.621.59.3
gazeta_pgn_words_13kk29.412.624.420.98.9
gazeta_summarunner_3kk31.613.727.126.011.5
gazeta_mbart32.614.628.225.712.4
gazeta_mbart_lower32.714.728.325.812.5

Demo

python demo/server.py --include-package summarus --model-dir <model_dir> --host <host> --port <port>

Citations

Headline generation (PGN):

@article{Gusev2019headlines,
    author={Gusev, I.O.},
    title={Importance of copying mechanism for news headline generation},
    journal={Komp'juternaja Lingvistika i Intellektual'nye Tehnologii},
    year={2019},
    volume={2019-May},
    number={18},
    pages={229--236}
}

Headline generation (transformers):

@InProceedings{Bukhtiyarov2020headlines,
    author={Bukhtiyarov, Alexey and Gusev, Ilya},
    title="Advances of Transformer-Based Models for News Headline Generation",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages={54--61},
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_4}
}

Summarization:

@InProceedings{Gusev2020gazeta,
    author="Gusev, Ilya",
    title="Dataset for Automatic Summarization of Russian News",
    booktitle="Artificial Intelligence and Natural Language",
    year="2020",
    publisher="Springer International Publishing",
    address="Cham",
    pages="{122--134}",
    isbn="978-3-030-59082-6",
    doi={10.1007/978-3-030-59082-6_9}
}