Home

Awesome

FLD-Prover

This repository includes the code to train and evaluate language models on FLD corpora.

See the entry-point repository about the whole FLD project.

Releases (READ CAREFULLY to determine which branch suits you)

Other Framework

FLD training is also accessible through a logical reasoning framework called LogiTorch/logitorch. Specifically, LogiTorch enables the training of an "all-at-once prover that generates an entire logical proof at once. This prover differs from the original stepwise prover used in the paper and delivers slightly better performance.

Installation

The code has been tested on Python 3.8.5.

$ pip install -r ./requirements.txt
$ git clone https://github.com/hitachi-nlp/FLD-task.git && pip install -e ./FLD-task
$ export PYTHONPATH=`pwd -P`:$PYTHONPATH

How to train a prover

  1. Run the training script. We use the FLD (FLD.3 in the paper) corpus hosted by 🤗 huggingface hub:

    $ python\
        ./run_prover.py\
        --dataset_name hitachi-nlp/FLD.v2\
        --dataset_config_name default\
        --output_dir outputs/\
        --logging_dir outputs/tensorboard/\
        --file_type json\
        --predict_with_generate True\
        --remove_unused_columns False\
        --do_train True\
        --do_eval True\
        --do_predict False\
        --seed 0\
        --max_grad_norm 0.5\
        --max_steps 20000\
        --gradient_accumulation_steps 16\
        --max_eval_samples 500\
        --proof_sampling stepwise\
        --learning_rate 0.0001\
        --warmup_steps 1000\
        --model_name_or_path t5-base\
        --source_prefix "Solve FLD task: "\
        --generation_num_beams 10\
        --generation_top_k 10\
        --generation_max_proof_steps 20\
        --max_source_length 1700\
        --max_target_length 100\
        --logging_strategy steps\
        --logging_steps 25\
        --overwrite_output_dir True\
        --log_generation True\
        --sample_negative_proof True\
        --per_device_train_batch_size 1\
        --per_device_eval_batch_size 1\
        --dataloader_num_workers 0    \
        --log_examples True\
        --evaluation_strategy steps\
        --save_strategy steps \
        --max_predict_samples 1000\
        --eval_steps 5000\
        --tokenizer_padding longest
    

    or, if you want to use FLD★(FLD.4 in the paper), specify --dataset_config_name star.

    If you have the datasets on your local filesystem, swap the --dataset_name option to the following:

        --train_file ./data/FLD.v2/FLD.v2/train.jsonl\
        --validation_file ./data/FLD.v2/FLD.v2/valid.jsonl\
        --test_file ./data/FLD.v2/FLD.v2/test.jsonl\
    
  2. Check the results using tensorboard:

    $ tensorboard --port 6006 --logdir ./outputs/tensorboard/
    

Prover performance and the metrics

A prover trained for 20000 steps on each corpus should perform as follows:

corpusextr_stps.D-all.proof_accuracystrct.D-all.proof_accuracyD-all.answer_accuracy
FLD (FLD.3)85.275.891.6
FLD★(FLD.4)60.644.472.2

As seen above, we have defined the two types of metrics:

The difference in the two metrics is the most noticeable for a dataset instance with an unknown label, on which the strict metric allows the model to output only __UNKNOWN__ marker while the extra_steps metric allows the model to output some logical steps to investigate whether the hypothesis can be (dis-) proved or not.

We think both metrics have their Pros/Cons and both are OK for use as long as they are not contaminated. Note that the previous studies have used the metric colse to extra_steps regarding the unknown labels.