

European Flood 2013 Dataset (v1.0)

Examples from the dataset for each of the 3 tasks

This repository contains metadata and annotations of a flood dataset used in the context of interactive content-based image retrieval. The goal is to retrieve images from a larger amount of data that are useful to derive a certain type of information.

For this particular dataset, three such information objectives have been defined:

The majority of the 3,710 images in the dataset relate to the central European floods in May/June 2013 and have been fetched in July 2017 from the Wikimedia Commons Category "Central Europe floods, May/June 2013" and its sub-categories, excluding the category "Transport during 2013 Vltava flood in Prague", which is related to public transportation during the flood, but does not actually show flooding. A total of 3,435 images come from this source, 890 of them containing metadata about the geographical location where the photo has been taken.

As can be seen from the following map, the majority of images for which geolocation information is available are located in the areas of Dresden (Germany) and Prague (Czech Republic).

<p align="center"><img alt="Map" src="map.png"></p>

275 additional images showing water pollution have been harvested manually by querying online image search engines for the major oil spill events of the past few years. While the images from Wikimedia Commons are identified by their page ID, the pollution images are numbered consecutively from 1 to 275 and their identifiers are prefixed with "pollution_".

We recently released a related and similarly annotated dataset of flood-related images posted on Twitter, together with two classification models trained on the European Flood 2013 dataset and evaluated on the Twitter images. You can find both, the Twitter dataset and the models, in this repository.


The following paper describes the dataset in detail and conducts initial experiments for interactive flood image retrieval:

Björn Barz, Kai Schröter, Moritz Münch, Bin Yang, Andrea Unger, Doris Dransch, and Joachim Denzler.
"Enhancing Flood Impact Analysis using Interactive Image Retrieval of Social Media Images."
Archives of Data Science, Series A, 5.1, 2018.

If you use the dataset, please cite this paper.

Obtaining the Images

We provide two variants of the images in this dataset: one with the images resized so that the smaller side is at most 512 pixels, whereas the smaller side is limited to 1280 pixels in the second variant.

To create a realistic image retrieval scenario, distractor images from the Flickr100k dataset are usually added to this dataset, which can be obtained here.

Relevance Annotations

All images in the dataset have been annotated by hydrologists regarding their relevance for each of the three tasks mentioned above. Naturally, each image can be relevant for more than one or even no task at all.

The relevance annotations are provided in the directory relevance, which contains one text file for each task. Each text file contains a list of identifiers of images relevant for this task. Additionally, the file irrelevant.txt lists all images that are not relevant for any task.

The following Venn diagram illustrates the number of images assigned to each task:

Number of images relevant for each task

Selected Queries

For each task, some images that are particularly suitable as query images to be provided as a starting point for a content-based image retrieval system have been selected. The identifiers of these images are listed in the text files in the queries directory.

Important Image Regions

Some of the images have been annotated with bounding boxes denoting regions which are particularly important for the relevance of the image. For example, traffic signs or humans standing in the water could be helpful for determining inundation depth.

These annotations are provided in the directory important_regions, which contains a JSON file for each task. Each JSON file contains a dictionary mapping image identifiers to a set of region groups. Each group is identified by a number and contains a list of bounding boxes, given by the coordinates of the top left corner and the width and height of the box. Both the coordinates and the dimensions of the bounding boxes are given relatively to the dimensions of the image. Thus, to obtain the actual pixel values, left and width would have to be multiplied with the width of the image and top and height have to be multiplied with its height.

A group consisting of more than a single regions indicates that the regions in this group have to appear together in an image for making it relevant.

Wikimedia Metadata

The metadata for the images fetched from Wikimedia Commons is provided in the file metadata.json. It contains an array of objects, each one describing a particular image using the following attributes:

Pre-computed Features

We provide two sets of pre-computed features: one containing features of the images in this dataset only and the other one containing features for the images from Flickr100k in addition.

Both sets contain 4 pickle files for different types of features:

Each pickle file contains a dictionary mapping image identifiers to feature vectors. The identifiers of the images from Flickr100k are prefixed with "Flickr100k_".

For the images from this dataset only, we also provide local features extracted from the last convolutional layer of VGG16 without pooling: VGG16_relu5_3.pickle (240 MB)
The feature matrices are stored with the channels along the first axis.