Home

Awesome

Med-MMHL

This is the repository of the dataset corresponding to the article Med-MMHL: A Multi-Modal Dataset for Detecting Human- and LLM-Generated Misinformation in the Medical Domain. The data can be found at here.

Dataset Description

The data are already split into train/dev/test sets.

Below tables summarize the task and its source path, where the statistics are in Tab 2 of our paper.

TaskBenchmarked ResultsData Location
Fake news<br>detectionTab 3 in paperfakenews_article
LLM-generated fake sent detectionTab 3 in papersentence
Multimodal fake news detectionTab 3 in paperimage_article
Fake tweet detectionTab 4 in paperfakenews_tweet
Multimodal tweet detectionTab 4 in paperimage_tweet

For multimodal tasks, the paths to the images are stored in the column image. The path looks like /images/2023-05-09_fakenews/LeadStories/551_32.png for news. You do not need to modify the path of the images folder in the root directory of your project.

The content and images of tweets can be crawled with the code collect_by_tweetid_tweepy_clean.py or other legal twitter extraction tool given tweet IDs.

Enviroment Configure

conda create -f clip_env.yaml

conda activate clip_env

Running Baselines

Most of our baselines are drawn from Hugging Face, so you need to provide the name of the models to make the code run. The Hugging Face models included in our baseline experiments are listed below.

Model NameHugging Face Name
BERTbert-base-cased
BioBERTpritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb
Funnel Transformerfunnel-transformer/medium-base
FN-BERTungjus/Fake_News_BERT_Classifier
SentenceBERTsentence-transformers/all-MiniLM-L6-v2
DistilBERTsentence-transformers/msmarco-distilbert-base-tas-b
CLIPopenai/clip-vit-base-patch32
VisualBERTuclanlp/visualbert-vqa-coco-pre

Below are some examples of training and testing the Hugging Face models. Please refer to the code to explore more editable arguments.

To train a fine-tuned version of bioBERT, the command looks like this:

python fake_news_detection_main.py \
    -bert-type pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type fakenews_article

To test an existing model, the command is:

python fake_news_detection_main.py \
    -bert-type pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type fakenews_article \
    -snapshot path/to/your/model \
    -test

Similarly, to train and test a multimodal model, the commands are:

python fake_news_detection_multimodal_main.py \
    -clip-type uclanlp/visualbert-vqa-coco-pre \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type image_article

and

python fake_news_detection_multimodal_main.py \
    -clip-type uclanlp/visualbert-vqa-coco-pre \
    -device 0 \
    -batch-size 4 \
    -benchmark-path path/to/your/data \
    -dataset-type image_article \
    -snapshot path/to/your/model \
    -test

If you find the dataset is helpful, please cite

@article{sun2023med,
  title={Med-MMHL: A Multi-Modal Dataset for Detecting Human-and LLM-Generated Misinformation in the Medical Domain},
  author={Sun, Yanshen and He, Jianfeng and Lei, Shuo and Cui, Limeng and Lu, Chang-Tien},
  journal={arXiv preprint arXiv:2306.08871},
  year={2023}
}

or

Sun, Yanshen, et al. "Med-MMHL: A Multi-Modal Dataset for Detecting Human-and LLM-Generated Misinformation in the Medical Domain." arXiv preprint arXiv:2306.08871 (2023).