Awesome
Complex Claim Verification with Evidence Retrieved in the Wild
Getting started
Clone the repository and install the requirements:
pip install -r requirements.txt
Download the data to the local directory via: https://drive.google.com/file/d/1YMpr5hnJqzrXcp3kBwpVsAK0uv9q_Xf3/view?usp=sharing
Data format
The data files are formatted as jsonlines. The description of each field is as follows:
Field | type | Description |
---|---|---|
example_id | string | Example ID |
claim | string | Claim |
label | string | Label: pants-fire, false, barely-true, half-true, mostly-true, true |
person | string | Person who made the claim |
venue | string | Date and venue of the claim |
url | string | Politifact URL of the claim |
justification | List[string] | Justification paragraph written by the fact-checkers |
qg-output | List[string] | Sub-questions generated by claim decomposition |
search_results | List[dict] | Bing search results without timestamp |
search_results_timestamp | List[dict] | Bing search results with timestamp |
summary | string | Summary generated by synthesizing the results from second-stage retrieval |
summarization_prompt | string | Prompt used for generate claim-focused summary |
Each search_results
is a dictionary with the following fields:
search_results = {
"entities_info": [
{"name": name of the entity included in the search results,
"decsription": description of the entity
}
...
],
'pages_info': [
{"page_name": name of the page included in the search results,
"page_url": url of the page,
"page_timestamp": timestamp of the page,
"page_snippet": snippet of the page
}
...
]
}
Generate sub-questions using claim
To decompopse the claim into a set of sub-questions, you can run the following command:
python generate_subquestions.py --input_file ./data/dev-site-restricted.jsonl --output_file OUTPUT_FILE_PATH
Use Bing API to retrieve evidence
To use Bing API, you need to register a Bing API key from here. Then, you can run the following command to retrieve evidence using Bing:
python evidence_retrieval.py \
--input_url data/train.jsonl \
--output_file OUTPUT_FILE_PATH \
--use_time_stamp 1 \
--sites_constrain 1 \
--use_annotation 0 \
--use_claim 0 \
--question_num 10 \
--answer_count 10 \
--chunk_size 50 \
--time_offset 1
Check the argument description in evidence_retrieval.py
for more details.
Second stage retrieval + summarization
To generate the claim-focused summary, you can run the following command:
pyhton python generate_summarization.py \
--input_path ./data/train-site-restricted.jsonl \
--corpus_path ./data/corpus/train.json \
--output_path OUTPUT_FILE_PATH \
--num 5 \
--window_size 30 \
--topk_units 10 \
--topk_docs 5 \
--stride 15 \
--use_claim 0 \
--use_annotation 0 \
--use_justification 0 \
--text_unit span \
--time_restricted 1 \
--time_window 1 \
--summ_type multi_summs \
--use_ChatGPT 0 \
--use_fewshot 0 \
--filter_politifact 0
Check the argument description in generate_summarization.py
for more details.
Veracity prediction
To predict the veracity of the claim, you can run the following command:
bash run_veracity_classification.sh