Home

Awesome

GenCompareSum

This is the repository asssociated with the paper 'GenCompareSum: a hybrid unsupervised summarization method using salience'

The method is as follows: (a) We split the document into sentences. (b) We combine these sentences into sections of several sentences. (c) We feed each section into the generative text model and generate several text fragments per section. (d) We aggregate the questions, removing redundant questions by using n-gram blocking. Where aggregation occurs, we apply a count to represent the number of textual fragments which were combined and use this as a weighting going forwards. The highest weighted textual fragments are then selected to guide the summary. (e) The similarity between each sentence from the source document and each selected textual fragment is calculated using BERTScore. (f) We create a similarity matrix from the scores calculated in the previous step. These are then summed over the textual fragments, weighted by the values calculated in step (d), to give a score per sentence. (g) The highest scoring sentences are selected to form the summary.

<img src="./Images/GenCompareSum.png" width="800">

Data

Sample data set

An example of how data should be formatted in included in the ./data/preprocessed_pubmed_sample.csv. GenCompareSum.py expects to be pointed to a csv file which contains:

In our example data, we have columns article_text, which contains the full articles which have been split into sentences using the StanfordNLP software package, and a columns short_article which contains the same data, but truncated at the end of sentence which contains the 512th token.

To prepare data sets used in the publication, follow this README.md

Set Up

This repo requires python>=3.7. To install the dependencies for this repo:

pip install -r ./requirements.txt

To get pyrouge to work:

git clone https://github.com/andersjo/pyrouge.git ./src/rouge
pyrouge_set_rouge_path <PATH_TO_THIS_REPO>/src/rouge/tools/ROUGE-1.5.5
cd <PATH_TO_THIS_REPO>/src/rouge/tools/ROUGE-1.5.5/RELEASE-1.5.5/data/
rm WordNet-2.0.exc.db
./WordNet-2.0-Exceptions/buildExeptionDB.pl ./WordNet-2.0-Exceptions ./smart_common_words.txt ./WordNet-2.0.exc.db

Note that we adapt the original BERTScore code slightly, and include it within our repo under src/bert_score.

Run GenCompareSum

This code can be run on cpu or gpu, however, runs significantly slower on cpu.

To evaluate method with ROUGE metrics

python GenCompareSum.py --num_generated_texts 10 --block_n_gram_generated_texts 4 --col_name article_text --summary_len_metric sentences --num_sentences 9 --block_n_gram_sum 4   --visible_device 0 --texts_per_section 3 --temperature 0.5 --stride 4 --gen_text_weights 1 --data_path ./data/sample_preprocessed/pubmed.csv --generative_model_path doc2query/S2ORC-t5-base-v1  --similarity_model_name bert_score --similarity_model_path bert-base-uncased --inference_only False --save_predictions False

The generative models used in our paper were:

The config params are as follows:

Comparative methods for summarization

This folder contains code to run comparative methods which were reference in the paper GenCompareSum: a hybrid unsupervised summarization method using salience.

  1. BERTExtSum can be implemented directly.
  2. For ORACLE calculations, run:
python comparative_methods/oracle.py --data_csv  ./data/sample_preprocessed/pubmed.csv --col_name article_text --num_sentences 9
  1. For other methods, run:
python ./comparative_methods/run_comparative_summary_method.py --data_csv  ./data/sample_preprocessed/pubmed.csv --col_name article_text --num_sentences 9 --method  <method>

method options include: