Home

Awesome

NewsCLIPpings Dataset

DOI

Our dataset with automatically generated out-of-context image-caption pairs in the news media. For inquiries and requests, please contact graceluo@berkeley.edu.

Requirements

Make sure you are running Python 3.6+.

Getting Started

  1. Request the VisualNews Dataset. Place the files under the visual_news folder.
  2. Run ./download.sh to download our matches and populate the news_clippings folder (place into news_clippings/data/).
  3. Consider doing analyses of your own using the embeddings we have provided (place into news_clippings/embeddings/).

All of the ids and image paths provided in our data/ folder exactly correspond to those listed in the data.json file in VisualNews.

<!--If you have trouble running our download script, you can find everything at [http://news_clippings.berkeleyvision.org](http://news_clippings.berkeleyvision.org).-->

Your file structure should look like this:

news_clippings
│
└── data/
└── embeddings/

visual_news
│
└── origin/
│    └── data.json
│        ...
└── ...
<!-- Set up MMF ``` pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/torch_stable.html ``` --> <!-- 3. Example command for training / finetuning with MMF. ``` MMF_USER_DIR="." nohup mmf_run config="./configs/experiments/clip.yaml" model=clip dataset=foil run_type=train > clip_train.out & ``` -->

Data Format

The data is ordered such that every even sample is pristine, and the next sample is its associated falsified sample.

Here's an example of how you can start using our matches:

    import json
    visual_news_data = json.load(open("visualnews/origin/data.json"))
    visual_news_data_mapping = {ann["id"]: ann for ann in visual_news_data}
    
    data = json.load(open("news_clippings/data/merged_balanced/val.json"))
    annotations = data["annotations"]
    ann = annotations[0]
    
    caption = visual_news_data_mapping[ann["id"]]["caption"]
    image_path = visual_news_data_mapping[ann["image_id"]]["image_path"]
    
    print("Caption: ", caption)
    print("Image Path: ", image_path)
    print("Is Falsified: ", ann["falsified"])

Embeddings

We include the following precomputed embeddings:

The following embedding types were not used in the construction of our dataset, but you may find them useful.

All embeddings are dictionaries of {id: numpy array} stored in pickle files for train / val / test. You can access the features for each image / caption by its id like so:

    import pickle
    clip_image_embeddings = pickle.load(open("news_clippings/embeddings/clip_image_embeddings/test.pkl", "rb"))
    id = 701864
    print(clip_image_embeddings[id])

Metadata

We have additional metadata, such as the spaCy and REL named entities, timestamp, location of the original article content, etc.

Training

To run the benchmarking experiments we reported in our paper, look at the README for news_clippings_training/.

Citing

If you find our dataset useful for your research, please, cite the following paper:

@article{luo2021newsclippings,
  title={NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media},
  author={Luo, Grace and Darrell, Trevor and Rohrbach, Anna},
  journal={arXiv:2104.05893},
  year={2021}
}
<!-- ``` @misc{singh2020mmf, author = {Singh, Amanpreet and Goswami, Vedanuj and Natarajan, Vivek and Jiang, Yu and Chen, Xinlei and Shah, Meet and Rohrbach, Marcus and Batra, Dhruv and Parikh, Devi}, title = {MMF: A multimodal framework for vision and language research}, howpublished = {\url{https://github.com/facebookresearch/mmf}}, year = {2020} } @misc{liu2020visualnews, title={VisualNews : Benchmark and Challenges in Entity-aware Image Captioning}, author={Fuxiao Liu and Yinghan Wang and Tianlu Wang and Vicente Ordonez}, year={2020}, eprint={2010.03743}, archivePrefix={arXiv}, primaryClass={cs.CV} } @misc{radford2021learning, title={Learning Transferable Visual Models From Natural Language Supervision}, author={Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever}, year={2021}, eprint={2103.00020}, archivePrefix={arXiv}, primaryClass={cs.CV} } @article{zhou2017places, title={Places: A 10 million Image Database for Scene Recognition}, author={Zhou, Bolei and Lapedriza, Agata and Khosla, Aditya and Oliva, Aude and Torralba, Antonio}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, year={2017}, publisher={IEEE} } @misc{wang2020sbertwk, title={SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models}, author={Bin Wang and C. -C. Jay Kuo}, year={2020}, eprint={2002.06652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ``` -->