Home

Awesome

RSIVQA (Remote Sensing Image Visual Question Answering)

RSIVQA is a remote sensing visual question answering dataset proposed in the paper "Mutual Attention Inception Network for Remote Sensing Visual Question Answering". If this data set is used in your work, please cite our paper.

[1] X. Zheng, B. Wang, X. Du, and X. Lu, “Mutual Attention Inception Network for Remote Sensing Visual Question Answering,” IEEE Transactions on Geoscience Remote Sensing, 2021.

@article{ 9444570,
author = {Zheng, Xiangtao and Wang, Binqiang and Du, Xingqian and Lu, Xiaoqiang},
doi = {10.1109/TGRS.2021.3079918},
issn = {0196-2892},
journal = {IEEE Transactions on Geoscience and Remote Sensing},
title = {{Mutual Attention Inception Network for Remote Sensing Visual Question Answering}},
year = {2021}
}

Overview

RSIVQA dataset is derived from existing remote sensing image (RSI) datasets with a specially designed generation method.

Currently, images of RSIVQA come from three RSI classification datasets (UCM, Sydney and AID) and two RSI object detection datasets (HRRSD and DOTA). Questions and answers are generated based on images to form image-question-answer triplets. Questions, answers and their correspondence can be found in txt files in this repository. Images of the datasets can be downloaded from the link of each dataset.

Deatails

There are 37264 images and 111134 image-question-answer triplets in the dataset. Detaled information is summarized in the table below.

ItemAmount
The number of images37264
The number of unique questions91
The number of unique answers574
The total number of VQA triplets111134
The number of yes or no VQA triplets64610
The number of number VQA triplets32331
The number of others VQA triplets14193

A small part of RSIVQA is annotated by human. Others are automatically generated using existing scene classification datasets and object detection datasets. For more detailed information of the generation method, please refer to the paper.
Note that 559 triplets are added in current version, which makes up to 111693 vqa triplets in total.

Files

The question and answers are saved in txt files. Each line includes an IQA triplet and three parts are seprated by colon and question mark.
For example, in sydney_vqa.txt, the first line is

1.tif:what is the theme of this picture?residential

which means the question for the image "1.tif" is "what is the theme of this picture?". The answer for the question is "residential". "1.tif" is file name of the image.

Image-question-answer triplets for each origin scene classification or object detection dataset are saved in corresponding folders.