Awesome
Data and Code for "Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection" (CIKM 2023)
This repo contains the data and code for the following paper:
Jiaying Wu, Shen Li, Ailin Deng, Miao Xiong, Bryan Hooi. Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection, ACM Conference on Information and Knowledge Management (CIKM) 2023.
Abstract
Despite considerable advances in automated fake news detection, due to the timely nature of news, it remains a critical open question how to effectively predict the veracity of news articles based on limited fact-checks. Existing approaches typically follow a “Train-from-Scratch” paradigm, which is fundamentally bounded by the availability of large-scale annotated data. While expressive pre-trained language models (PLMs) have been adapted in a “Pre-Train-and-Fine-Tune” manner, the inconsistency between pre-training and downstream objectives also requires costly task-specific supervision. In this paper, we propose “Prompt-and-Align” (P&A), a novel prompt-based paradigm for few-shot fake news detection that jointly leverages the pre-trained knowledge in PLMs and the social context topology. Our approach mitigates label scarcity by wrapping the news article in a task-related textual prompt, which is then processed by the PLM to directly elicit task-specific knowledge. To supplement the PLM with social context without inducing additional training overheads, motivated by empirical observation on user veracity consistency (i.e., social users tend to consume news of the same veracity type), we further construct a news proximity graph among news articles to capture the veracity-consistent signals in shared readerships, and align the prompting predictions along the graph edges in a confidence-informed manner. Extensive experiments on three real-world benchmarks demonstrate that P&A sets new states-of-the-art for few-shot fake news detection performance by significant margins.
Requirements
We implement our proposed ''Prompt-and-Align'' (P&A) model and its variants based on PyTorch 1.8.0 with CUDA 11.1, and train them on a server running Ubuntu 18.04 with NVIDIA RTX 3090 GPU and Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz.
Main dependencies include:
python==3.7.0
numpy==1.21.4
torch==1.8.0+cu111
transformers==4.13.0
Data
Our work is based on three benchmark datasets, including the PolitiFact
and GossipCop
datasets from the FakeNewsNet benchmark, and the FANG
dataset (Nguyen et al., 2020).
Extract the files in data.tar.gz
to obtain an unzipped data/
folder. The resultant data/
should contain a list of .csv files and the following three folders: adjs/
, news_article_raw/
and social_context_raw/
.
News Articles
The .csv files under data/
contain the news articles and their ground-truth labels. For each news article, label 0
represents real news, whereas label 1
represents fake news.
For each dataset, data/[dataset_name]_train_[train_size].csv
is the training set given train_size
annotated samples, and data/[dataset_name]_test.csv
is the test set.
Social Context
The .pkl files under data/user_news_graph/
each contain the following list: [train_conf, adj]. train_conf
of shape [train_size, 2]
refers to the ground-truth confidence matrix of the training samples (i.e., each row being [1., 0.]
or [0., 1.]
), and adj_matrix
refers to the adjacency matrix of the news proximity graph.
The adjacency matrix of the news proximity graph is constructed via the following steps:
- Collect the user-news repost records for all samples (including
train_size
training samples and the test samples), specifically in the form of[sid, tid, uid]
for each line, which means that useruid
has reposted news articlessid
and the repost has Tweet ID oftid
. - Filter the records with threshold $t_u$ to focus on the set of active social users. In our method, $t_u$ is set to 5, i.e., we only consider the social users with at least 5 repost records.
- Construct a user engagement matrix $\mathbf{B}$ of size
[num_active_users, num_news]
where the value of element [k, m] in the matrix represents the number of reposts between active user $u_k$ and news article $T_m$. - Formulate the adjacency matrix as $\mathbf{A}_ {n}=\mathbf{B}^\top \mathbf{B}$, and conduct normalization to derive the final normalized adjacency matrix as $\mathbf{A}_ {\mathcal{T}} = \mathbf{D}_ {n}^{-\frac{1}{2}} \mathbf{A}_ {n} \mathbf{D}_ {n}^{-\frac{1}{2}}$.
Raw Data
We provide the raw user-news engagement records used to construct the above-mentioned news proximity graph from scratch, at data/social_context_raw/[dataset_name]_socialcontext_train[train_size].csv
. There, each line is given as: [sid,label,tid,uid]
, meaning that user uid
has reposted news articles sid
of veracity label (0: real; 1: fake), and the repost has Tweet ID of tid
.
We also provide the raw news articles and metadata in the format of [news_id,label,title,content]
at data/news_articles_raw/[dataset_name]_full_train[train_size].csv
. Here, news_id
refer to the article's ID in the FakeNewsNet / FANG dataset. In each file, the [train_size] training samples are placed first, followed by the test samples.
Constructing the News Proximity Graph from the Raw Social Context
As an alternative to using our pre-processed adjacency matrices under data/adjs/
, we provide a pre-processing script at Process/adj_matrix_fewshot.py
to construct the matrices from scratch.
Construct the adjacency matrices with the following command:
mkdir data/adjs_from_scratch
python Process/adj_matrix_fewshot.py
Obtaining Raw Social Context via FakeNewsNet Scripts
The datasets used in our work are in line with the FakeNewsNet format. To retrieve the social engagements and auxiliary features related to social context, please follow the instructions and scripts given in the FakeNewsNet GitHub repo.
Run Prompt-and-Align
The P&A source code is provided in prompt_and_align.py
.
Start training with the following command:
sh run.sh
Results will be saved under the logs/
directory. We also provide our original experiment logs under logs/logs_archive/
.
Contact
jiayingwu [at] u.nus.edu
Citation
If you find this repo or our work useful for your research, please consider citing our paper
@inproceedings{wu2023prompt,
author = {Wu, Jiaying and Li, Shen and Deng, Ailin and Xiong, Miao and Hooi, Bryan},
title = {Prompt-and-Align: Prompt-Based Social Alignment for Few-Shot Fake News Detection},
year = {2023},
booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
pages = {2726–2736}
}