Home

Awesome

Syntax-Guided Controlled Generation of Paraphrases

Source code for TACL 2020 paper: Syntax-Guided Controlled Generation of Paraphrases

<p align="center"> <img align="center" src="https://github.com/malllabiisc/SGCP/blob/master/images/SGCP.png" alt="Image" height="420" > </p>

Dependencies

Setup

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/SGCP

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Create essential folders in the repository using:

$ chmod a+x setup.sh
$ ./setup.sh

Resources

Dataset

Path: SGCP/data/<dataset-folder-name>.

A sample dataset folder might look like

data/QQPPos/<train/test/val>/<src.txt/tgt.txt/refs.txt/src.txt-corenlp-opti/tgt.txt-corenlp-opti/refs.txt-corenlp-opti>

Pre-trained Models:

Path: SGCP/Models/<dataset_Models>

Evaluation Essentials

Path: SGCP/src/evaluation/<apps/data/ParaphraseDetection>

This contains all the necessary files needed to evaluate the model. It also contains the Paraphrase Detection Score Models for Model-based evaluation.

Training the model

Generation and Evaluation

Custom Dataset Processing

Preprocess and parse the data using the following steps.

  1. Move the contents of your custom dataset in the data/ directory, with files arranged something like this:

    • data
      • Custom_Dataset
        • train
          • src.txt
          • tgt.txt
        • val
          • src.txt
          • tgt.txt
          • ref.txt
        • test
          • src.txt
          • tgt.txt
          • ref.txt

    Here, src.txt contains the source sentences, tgt.txt contains exemplars and ref.txt contains the paraphrases.

  2. Construct a byte-pair codes file which will be used to generate byte pair encodings of the dataset. From the main directory of this repo, run: subword-nmt learn-bpe <data/Custom_Dataset/train/src.txt> data/Custom_Dataset/train/codes.txt Note: [Optional] Generate codes from both src.txt and tgt.txt - For that first concatenate the two files and replace src.txt with the name of the concatenated file in the command.

  3. Parse the data files using stanford corenlp. First start a corenlp server by executing the following commands:

cd src/evaluation/apps/stanford-corenlp-full-2018-10-05
java -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos,lemma,ner,parse -parse.model /edu/stanford/nlp/models/srparser/englishSR.ser.gz -status_port <PORT_NUMBER> -port <PORT_NUMBER> -timeout 15000
  1. Finally run the parser on the text files.
cd <PATH_TO_THIS_REPO>
python -m src.utils.con_parser -infile data/Custom_Dataset/train/src.txt -codefile data/Custom_Dataset/train/codes.txt -port <PORT_NUMBER (where the corenlp server is running, from step 3)> -host localhost

This will generate a file in train folder called src.txt-corenlp-opti Run this for all other files i.e. tgt.txt in train folder, src.txt, tgt.txt, ref.txt in val folder and similarly for the files in test folder.

Citing:

Please cite the following paper if you use this code in your work.

@article{sgcp2020,
author = {Kumar, Ashutosh and Ahuja, Kabir and Vadapalli, Raghuram and Talukdar, Partha},
title = {Syntax-Guided Controlled Generation of Paraphrases},
journal = {Transactions of the Association for Computational Linguistics},
volume = {8},
number = {},
pages = {330-345},
year = {2020},
doi = {10.1162/tacl\_a\_00318},
URL = { https://doi.org/10.1162/tacl_a_00318 },
eprint = { https://doi.org/10.1162/tacl_a_00318 }
}

For any clarification, comments, or suggestions please create an issue or contact ashutosh@iisc.ac.in