Home

Awesome

Complex Claim Verification with Evidence Retrieved in the Wild

Getting started

Clone the repository and install the requirements:

pip install -r requirements.txt

Download the data to the local directory via: https://drive.google.com/file/d/1YMpr5hnJqzrXcp3kBwpVsAK0uv9q_Xf3/view?usp=sharing

Data format

The data files are formatted as jsonlines. The description of each field is as follows:

FieldtypeDescription
example_idstringExample ID
claimstringClaim
labelstringLabel: pants-fire, false, barely-true, half-true, mostly-true, true
personstringPerson who made the claim
venuestringDate and venue of the claim
urlstringPolitifact URL of the claim
justificationList[string]Justification paragraph written by the fact-checkers
qg-outputList[string]Sub-questions generated by claim decomposition
search_resultsList[dict]Bing search results without timestamp
search_results_timestampList[dict]Bing search results with timestamp
summarystringSummary generated by synthesizing the results from second-stage retrieval
summarization_promptstringPrompt used for generate claim-focused summary

Each search_results is a dictionary with the following fields:

search_results = {
    "entities_info": [
        {"name": name of the entity included in the search results,
         "decsription": description of the entity
        }
        ...
    ],
    'pages_info': [
        {"page_name": name of the page included in the search results,
         "page_url": url of the page,
         "page_timestamp": timestamp of the page,
         "page_snippet": snippet of the page
        }
        ...
    ]
}

Generate sub-questions using claim

To decompopse the claim into a set of sub-questions, you can run the following command:

python generate_subquestions.py --input_file ./data/dev-site-restricted.jsonl --output_file OUTPUT_FILE_PATH

Use Bing API to retrieve evidence

To use Bing API, you need to register a Bing API key from here. Then, you can run the following command to retrieve evidence using Bing:

python evidence_retrieval.py \
--input_url data/train.jsonl \
--output_file OUTPUT_FILE_PATH \
--use_time_stamp 1 \
--sites_constrain 1 \
--use_annotation 0 \
--use_claim 0 \
--question_num 10 \
--answer_count 10 \
--chunk_size 50 \
--time_offset 1

Check the argument description in evidence_retrieval.py for more details.

Second stage retrieval + summarization

To generate the claim-focused summary, you can run the following command:

pyhton python generate_summarization.py \
--input_path ./data/train-site-restricted.jsonl \
--corpus_path ./data/corpus/train.json \
--output_path OUTPUT_FILE_PATH \
--num 5 \
--window_size 30 \
--topk_units 10 \
--topk_docs 5 \
--stride 15 \
--use_claim 0 \
--use_annotation 0 \
--use_justification 0 \
--text_unit span \
--time_restricted 1 \
--time_window 1 \
--summ_type multi_summs \
--use_ChatGPT 0 \
--use_fewshot 0 \
--filter_politifact 0

Check the argument description in generate_summarization.py for more details.

Veracity prediction

To predict the veracity of the claim, you can run the following command:

bash run_veracity_classification.sh