Home

Awesome

Generating Natural Language Proofs with Verifier-Guided Search

Task

Code for the paper:

Generating Natural Language Proofs with Verifier-Guided Search
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Kaiyu Yang, Jia Deng, and Danqi Chen

Quick Links

Requirements

  1. Download and install Miniconda Python 3 (Anaconda should also work).
  2. Clone this repo and cd into its root.
  3. Install Python dependencies: conda env create -f nlproofs.yaml. You may need to edit nlproofs.yaml according to your system, e.g., use a different CUDA version. If you have trouble running the installation command, you may also manually install the packages in nlproofs.yaml in whatever way that works for you.
  4. Activate the conda environment: conda activate nlproofs, and prepend the root of this repo to the PYTHONPATH environment variable.

Data Preprocessing

  1. Download the v3_May6_2022 version of EntailmentBank (MD5: 9cb91896325157cee1f35616be0be179) and unzip it as ./data/entailment_trees_emnlp2021_data_v3/.
  2. Download the OWA version of RuleTaker (MD5: bf490364bca241bb5ff9f0ab0c78b71a) and unzip it as ./data/proofwriter-dataset-V2020.12.3/.
  3. Run python check_data.py to check.
  4. Run python preprocess_ruletaker.py to preprocess the RuleTaker dataset.

EntailmentBank Experiments

We use Lightning CLI to create scripts for training, validation, and testing: prover/main.py and verifier/main.py for the prover and the verifier, respectively. They take arguments from the command line as well as YAML configuration files. Please run python main.py --help or refer to the documentation of Lightning CLI for details.

We provide YAML files for our hyperparameters and experimental settings in ./prover/ and ./verifier/. We run all experiments on a single NVIDIA A6000 GPU with 48GB memory. For running them on GPUs with smaller memory, you may have to change batch_size and accumulate_grad_batches. On newer GPUs, --trainer.precision bf16 may lead to significant speedup and memory savings. I have not tested those features thoroughly, so please use them at your own discretion. Note that pretrained T5 models do not play well with fp16.

Training

Prover

First, cd into ./prover/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_task1_single_shot_t5-large.yaml  # Train a single-shot prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task1_stepwise_t5-large.yaml     # Train a stepwise prover on Task 1 of EntailmentBank.
python main.py fit --config cli_task2_single_shot_t5-large.yaml  # Train a single-shot prover on Task 2 of EntailmentBank.
python main.py fit --config cli_task2_stepwise_t5-large.yaml     # Train a stepwise prover on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./prover/lightning_logs/EXP_ID/, where EXP_ID is an arbitrary experiment ID that will be printed by the training script.

Verifier

First, cd into ./verifier/. Then run python main.py fit --help to see how to use the training script. Below are example commands used in our experiments:

python main.py fit --config cli_entailmentbank_task1.yaml  # Train a verifier on Task 1 of EntailmentBank.
python main.py fit --config cli_entailmentbank_task2.yaml  # Train a verifier on Task 2 of EntailmentBank.

The training script saves hyperparameters, model checkpoints, and other information to ./verifier/lightning_logs/EXP_ID/.

Validation and Testing

Once training completes, we use the model checkpoint to predict on the validation and testing data. cd into ./prover/ and run python main.py validate --help and python main.py test --help to see how to use the script for validation and testing. Assume we have a prover checkpoint PATH_TO_PROVER_CKPT and a verifier checkpoint PATH_TO_VERIFIER_CKPT, below are example commands:

python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                     # Validate the stepwise prover without verifier-guided search on Task 2 of EntailmentBank.
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS (stepwise prover + verifier-guided search).
python main.py validate --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 1.0 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true   # Validate NLProofS w/o prover score.
python main.py test --config cli_task2_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true       # Test NLProofS (stepwise prover + verifier-guided search).
python main.py test --config cli_task1_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT                                                                                                   # Test the single-shot prover on Task 1 of EntailmentBank.
python main.py test --confing cli_task2_single_shot_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --data.path_test ../data/entailment_trees_emnlp2021_data_v3/dataset/task_3/test.jsonl            # Test the single-shot prover (trained on Task 2) on Task 3 of EntailmentBank.

Validation and testing results are saved as ./prover/lightning_logs/EXP_ID/results_val.tsv and ./prover/lightning_logs/EXP_ID/results_test.tsv. They are the input to the EntailmentBank's official evaluation code for calculating the evaluation metrics.

Test Results and Model Checkpoints

Slide right to see download links in the tables below.

Task 1

ModelLeaves-F1Leaves-AllCorrectSteps-F1Steps-AllCorrectIntermediates-F1Intermediates-AllCorrectOverall-AllCorrectModel checkpointsValidation predictionsTest predictions
NLProofS97.690.054.841.872.039.738.2prover, verifierresults_val.tsvresults_test.tsv
Stepwise prover98.898.554.841.571.938.536.8The prover aboveresults_val.tsvresults_test.tsv
Single-shot prover98.282.751.840.966.736.534.7proverresults_val.tsvresults_test.tsv

Task 2

ModelLeaves-F1Leaves-AllCorrectSteps-F1Steps-AllCorrectIntermediates-F1Intermediates-AllCorrectOverall-AllCorrectModel checkpointsValidation predictionsTest predictions
NLProofS90.360.648.635.670.339.434.4prover, verifierresults_val.tsvresults_test.tsv
Stepwise prover90.357.148.635.670.138.533.8The prover aboveresults_val.tsvresults_test.tsv
Single-shot prover85.944.741.329.162.531.527.7proverresults_val.tsvresults_test.tsv

Task 3

Results on Task 3 are produced by evaluating Task 2 models zero-shot on Task 3 data (by changing --data.path_val and --data.path_test).

ModelLeaves-F1Leaves-AllCorrectSteps-F1Steps-AllCorrectIntermediates-F1Intermediates-AllCorrectOverall-AllCorrectModel checkpointsValidation predictionsTest predictions
NLProofS43.99.110.66.842.415.96.8Same as Task 2results_val.tsvresults_test.tsv
Stepwise prover42.87.49.35.942.115.05.9Same as Task 2results_val.jsonresults_test.json
Single-shot prover40.54.49.13.835.37.93.8Same as Task 2results_val.tsvresults_test.tsv

Students in Princeton's COS484 (Emre Onal, Max Gonzalez Saez-Diez, and Maria Khartchenko) have conducted a comprehensive ablation study and improved our results on EntailmentBank (code available here).

RuleTaker Experiments

Training

Prover

Training on RuleTaker is similar to training on EntailmentBank but with different configuration files. Run the following commands in ./prover/:

python main.py fit --config cli_ruletaker_single_shot_t5-large.yaml  # Train a single-shot prover on D0–D3 of RuleTaker (OWA).
python main.py fit --config cli_ruletaker_stepwise_t5-large.yaml     # Train a stepwise prover on D0–D3 of RuleTaker (OWA).

Verifier

Training the verifier is also similar. Run the following commands in ./verifier/:

python main.py fit --config cli_ruletaker.yaml  # Train a verifier on D0–D3 of RuleTaker (OWA).

Validation and Testing

cd into ./prover/. Assume we have a prover checkpoint PATH_TO_PROVER_CKPT and a verifier checkpoint PATH_TO_VERIFIER_CKPT.

python main.py validate --config cli_ruletaker_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true --trainer.limit_val_batches 1.0  # Validate NLProofS on D0–D3 of RuleTaker (OWA).
python main.py test --config cli_ruletaker_stepwise_t5-large.yaml --ckpt_path PATH_TO_PROVER_CKPT --model.verifier_weight 0.5 --model.verifier_ckpt PATH_TO_VERIFIER_CKPT --model.proof_search true  # Test NLProofS on D0–D3 of RuleTaker (OWA).

Note the --trainer.limit_val_batches 1.0 above. By default, we use only 200 batches for RuleTaker validation (see ./prover/cli_ruletaker_stepwise_t5-large.yaml and ./prover/cli_ruletaker_single_shot_t5-large.yaml), but here we want to use all batches.

Validation and testing results are saved as ./prover/lightning_logs/EXP_ID/results_val.json and ./prover/lightning_logs/EXP_ID/results_test.json. Run the following command for final evaluation:

python evaluate.py ruletaker --path-val PATH_TO_VAL_RESULTS --path-test PATH_TO_TEST_RESULTS

Test Results and Model Checkpoints

ModelAnswer accuracyProof accuracyModel checkpointsValidation predictionsTest predictions
NLProofS99.399.2prover, verifierresults_val.jsonresults_test.json
Stepwise prover68.791.3The prover aboveresults_val.jsonresults_test.json
Single-shot prover56.372.6proverresults_val.jsonresults_test.json

Bugs or Questions

If you have any questions related to the code or the paper, feel free to email Kaiyu. If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

@inproceedings{yang2022nlproofs,
  title={Generating Natural Language Proofs with Verifier-Guided Search},
  author={Yang, Kaiyu and Deng, Jia and Chen, Danqi},
  booktitle={Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2022}
}