Home

Awesome

ConCoRD: Consistency Correction through Relation Detection

This repository contains a high-level implementation of ConCoRD, the system proposed in the EMNLP 2022 paper Enhancing Self-Consistency and Performance of Pretrained Language Models with NLI, as well as the steps to reproduce the results in the paper. See the project website for an overview of what ConCoRD does and how it works.

Data

ConCoRD doesn't perform training or fine-tuning, so it doesn't use any training data. However, it does have hyperparameters, so we provide the small validation sets used for hyperparameter tuning in addition to the datasets used for evaluation in this Google Drive folder.

Pre-trained NLI Models

ConCoRD uses off-the-shelf NLI models to perform relation detection and ultimately enhance model self-consistency & accuracy. We use the following NLI models, all from the wonderful HuggingFace library:

Closed-Book Question-Answering (Section 4.1)

The BeliefBank dataset contains:

QA models evaluated include:

Since ConCoRD does not modify the QA or NLI models, for efficiency we cache the inference results from the models on BeliefBank data. The following sections walk through our full pipeline for generating results, but we have also uploaded our cached inference results to the Drive folder if you would like to directly experiment with those instead. All file paths are given relative to the top-level nli-consistency/ directory.

Preprocess BeliefBank

Preprocess calibration and silver facts by using pre-written templates to create question and answer pairs.

python cbqa/preprocess.py -f data/cbqa/beliefbank-data-sep2021/calibration_facts.json -o {output file path}

Repeat for silver facts.

Cached file paths: data/cbqa/calibration_facts_preprocessed_by_entity.json, data/cbqa/silver_facts_preprocessed_by_entity.json

QA Inference

For each of Macaw large and 3B, generate a cache of QA results for each of the calibration and silver facts preprocessed results.

For example, for Macaw large and calibration facts:

python -m cbqa.qa_score_dataset -m allenai/macaw-large -f data/cbqa/calibration_facts_preprocessed_by_entity.json -o {output file path}

Cached QA results are under data/cbqa/qa-cache

NLI Inference

For each of the NLI models, run NLI inference between each question-answer pair.

For example, for RoBERTa large ANLI:

python -m cbqa.nli_score_dataset -m ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli -f data/cbqa/qa-cache/macaw-large/calibration-facts-qa-scored.json -o {output file path}

Cached NLI results are under data/cbqa/nli-cache

Hyperparamter Tuning (Appendix H.1.1)

Use the cached QA and NLI calibration facts results to facilitate tuning hyperparameters for the MaxSAT solver with hyperopt. Each QA-NLI model combination, along with QA-oracle, is evaluated. Results are stored in files under cbqa/tuned_hparams, where you can also find our original runs.

For example, to tune Macaw large with RoBERTa large ANLI:

python -m cbqa.main -m hparam -qa allenai/macaw-large --qa_scores_cached_path data/cbqa/qa-cache/macaw-large/calibration-facts-qa-scored.json -nli ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli --nli_scores_cached_path data/cbqa/nli-cache/roberta-large-snli/nli-scored-calibration-facts.csv

Inference

Let's put it all together.

Evaluate each QA-NLI model combination using tuned hyperparameters on the silver facts.

For example, to evaluate Macaw large with RoBERTa large ANLI:

python -m cbqa.main -qa allenai/macaw-large --qa_scores_cached_path data/cbqa/qa-cache/macaw-large/silver-facts-qa-scored.json -nli ynie/roberta-large-snli_mnli_fever_anli_R1_R2_R3-nli --nli_scores_cached_path data/cbqa/nli-cache/roberta-large-snli/nli-scored-silver-facts.csv -v 

The -v flag enables verbose output that allows you to see exactly which beliefs were flipped or untouched.

The oracle (golden constraints as NLI relations) can be run on Macaw large as follows:

python -m cbqa.main -qa allenai/macaw-large --qa_scores_cached_path data/cbqa/qa-cache/macaw-large/silver-facts-qa-scored.json --oracle -v

By using the cached QA and NLI results we included under data/cbqa, you can reproduce the numbers we report in Tables 1, 6, and 9 in the paper.

Ablations

Relation Type

Add either --ablation_keep_relation contradiction or --ablation_keep_relation entailment when running python -m cbqa.main (as shown above) for hyperparameter tuning and inference. Our results on the best NLI model for CBQA (RoBERTa ANLI) are reported in Table 5.

Entailment Correction

Pass --disable_ec when running python -m cbqa.main (as shown above) for hyperparameter tuning and inference. Our results on the best NLI model for CBQA (RoBERTa ANLI) are reported in Table 8.

Visual Question-Answering (Section 4.2)

This experiment evaluates ConCoRD on related questions from ConVQA about images from the Visual Genome.

QA models evaluated include:

Parameters (e.g., paths to datasets, CPU/GPU, etc.) are set by variables within each notebook. Please make sure that all paths are indicated properly for respective user in sections marked as:

### INSTRUCTION FOR USERS : INDICATE APPROPRIATE PATH

QA Inference

For LXMERT, in lieu of the tokenizer provided by HuggingFace, we use the token-to-text mapping from the LXMERT Github repository.

The QA conversion model that we use has a checkpoint available as a cached model, and the cached data listed throughout this section are available on-line as well.

Use the notebook vg-data-selection.ipynb to sample images and questions from ConVQA for the 'train' set

QA inference is then performed in the following notebooks:

Cached data from the data sampling and QA inference available on Google Drive:

NLI Inference

Evaluate the train/test set with various NLI models

Within the first_run directory, evaluate using the ANLI model with:

Within the second_run directory, evaluating using the MNLI and XXLARGE models with:

Cached data from NLI Inference available on Google Drive:

Hyperparameter tuning

Tune the hyperparameters on the train set, searching for the optimal NLI model, use of entailment correction and λ and β values

The main file that optimizes for the hyperparameters: visual_tune_mod.py

Here is an example of the use of visual_tune_mode.py

python3 visual_tune_mod.py -f vilt-run-train-10000im-3pred-40token-1seed_predictions_nli-mnli.json -o vilt-table6-mnli-nwe.trials -t 100 > vilt-table6-mnli-nwe.log

Optimal hyperparameters were manually noted and used for the next (final) step

ConCoRD Evaluation

Evaluate on the test set using the hyperparameters determined from step 3

The first main cell in the notebook <a href="./vqa/4-final-eval/20221019 vqa solve only test with opt_with timeout counter_with ablation and perfect consistency.ipynb">20221019 vqa solve only test with opt_with timeout counter_with ablation and perfect consistency.ipynb</a> contains the function for the final evaluation on the test set based on given hyperparameters.

The subsequent four cells contain outputs for the main results in section 4.3

Test-time Information Injection (Section 4.3)

QA Models

Before you start

In the semantic_filtering directory:

mkdir hyperparam_search
mkdir eval_results

Run Base Results

export CACHE_DIR=<directory where you want to store cached datasets, this is for huggingface caching>
export STORE_DIR=<your root directory for downloaded data files>/nq/
python3 eval_retrieve.py --mode=base --split={test, val} --model={t5-small, t5-large, t5-3b} --cache_dir=$CACHE_DIR --store_dir=$STORE_DIR

These should give you the baseline results reported in Section 4.3.

Run Oracle Results

Our intermediate data files are stored under the name cache_{val, test}_0.8_4_t5-{small, large, 3b}.jsonl. 0.8 and 4 correspond to the temperature and the number of responses we asked the QA model to generate, respectively.

To obtain the oracle results (upper bound of our results), run the following:

export CACHE_ID=<path to the intermediate data file of choice>
export RESULT_FILE=<filename for your result file>
python3 eval_answers.py --cache_id=<CACHE_ID> --result_file=<RESULT_FILE>

Run ConCoRD Results

export CACHE_DIR=<directory where you want to store cached datasets, this is for huggingface caching>
export STORE_DIR=<your root directory for downloaded data files>/nq/
python3 eval_retrieve.py --mode=gold --split={test, val} --model={t5-small, t5-large, t5-3b} --cache_dir=$CACHE_DIR --store_dir=$STORE_DIR

Run Ablations on Relation Types

Add the flag entail_only for entailment-only results, and contradiction_only for contradiction-only results to the commands above.

Running Hyperparam Search

export CACHE_DIR=<directory where you want to store cached datasets, this is for huggingface caching>
export STORE_DIR=<your root directory for downloaded data files>/nq/
python3 eval_retrieve.py --model={t5-small, t5-large, t5-3b} --cache_dir=$CACHE_DIR --store_dir=$STORE_DIR

The hyperparameter search might take 3 hours or longer depending on the amount of compute available. The results will be printed, or you can find the results stored in the hyperparam_search directory.

BibTeX

If ConCoRD is useful for your own research, you can cite our work with the following BibTeX entry:

@inproceedings{mitchell2022enhancing,
    title={Enhancing Self-Consistency and Performance of
        Pretrained Language Models with NLI},
    author={Mitchell, Eric and Noh, Joseph J. and Li, Siyan and
            Armstrong, William S. and Agarwal, Ananth and
            Liu, Patrick and Finn, Chelsea and Manning, Christopher D.},
    booktitle={Proceedings of the 2022 Conference on Empirical
            Methods in Natural Language Processing (EMNLP)},
    url={https://ericmitchell.ai/concord.pdf},
    year={2022},
    publisher={Association for Computational Linguistics}
}