Home

Awesome

Banner

Sapsucker Woods 60 Audiovisual Dataset

We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for advancing research on audiovisual fine-grained categorization. The dataset covers 60 species of birds that all occur in a specific geographic location: Sapsucker Woods, Ithaca, NY. It is comprised of images from existing datasets, and brand new, expert curated audio and video data. These species have a high probability of being seen or heard on the live FeederWatch Cam hosted at the Cornell Lab of Ornithology. The entire dataset is packaged up into one convenient tar file, see below for the download link. For questions, clarifications, or problems, please open an Issue on this repository.

You can find additional information about the dataset and detailed experimental results in our ECCV 2022 paper.

TODO

Dataset Components

Download Link

This dataset was compiled solely for use of computer vision researchers. The media in the SSW60 dataset are not to be redistributed or used for other non-research purposes. Please read the Terms of Use included with the dataset. The dataset can be downloaded here [31.1GB].

Running md5sum ssw60.tar.gz should produce af0a54ea1a897d130d91be8ffe0de81c ssw60.tar.gz. The dataset is approximately 32GB untarred.

Species Information

Information for the 60 species can be found in the taxa.csv file. We've provided label mappings between various datasets/taxonomies and the SSW60 dataset:

Example data from taxa.csv:

labelspecies_codeinat_taxon_idinat2021_labelnabirds_labelscommon_namescientific_namefamilyorder
00cangoo7089322657 457Canada GooseBranta canadensisAnatidae (Ducks, Geese, and Waterfowl)Anseriformes
11wooduc7107318881 314 613Wood DuckAix sponsaAnatidae (Ducks, Geese, and Waterfowl)Anseriformes
22mallar369303201102 317 616MallardAnas platyrhynchosAnatidae (Ducks, Geese, and Waterfowl)Anseriformes

Videos

5,400 mp4 video files can be found in the video_ml/ directory. Split and meta data information can be found in the video_ml.csv file. All video files have been converted to 25 frames per second, and all have the .mp4 file extension. The file path for a video should be constructed via video_ml/{asset_id}.mp4.

Example data from video_ml.csv:

asset_idlabelsplitfpsframe_countduration_secondsframe_heightframe_widthframe_channelsorginal_video_start_secondoriginal_video_end_secondoriginal_video_target_secondtarget_secondreliable_audio
04838210train253279.9610801920363736851
14833260train253209.9610801920335454051
24767230train252779.9610801920341514651

Audio

3,861 wav audio files can be found in the audio_ml/ directory. Split and meta data information can be found in the audio_ml.csv file. All audio files have 1 channel and have been converted to a sampling rate of 22050Hz. All audio files have a .wav extension. The file path for an audio recording should be constructed via audio_ml/{asset_id}.wav.

Example data from audio_ml.csv:

asset_idlabelsplitsampleratechannelssamplesduration_seconds
0455161910train22050122050010
12927556610train22050122050010
2550466510train22050122050010

Images

NABirds

10,221 image files from the NABirds dataset can be found in the images_nabirds/ directory. Split and meta data information can be found in the images_nabirds.csv file. All image files have a .jpg extension. The file path for an NABirds image should be constructed via images_nabirds/{asset_id}.jpg. Additional annotations for these images (bounding boxes and keypoints) can be obtained by downloading the NABirds dataset.

Example data from images_nabirds.csv:

asset_idlabelsplitheightwidthchannelsphotographer
00233251e10054daa99e6b68369e143fb0train10246833David Mozzoni
10a1e090d2db54cd088cb99e56bdb6cea0train10248203Kelley Sampeck
223cf7fde4f464920923ab00827560b750train67010243Alex Lamoreaux

iNat2021

21,600 image files from the iNat2021 Competition dataset can be found in the images_inat/ directory. Split and meta data information can be found in the images_inat.csv file. Note that this set of data has a validation split in addition to the train and test splits. All image files have a .jpg extension. The file path for an iNat image should be constructed via images_inat/{asset_id}.jpg.

Example data from images_inat.csv:

asset_idlabelsplitheightwidthchannelsrights_holderlicense_id
0735429train5003333Kent Miller5
17409911train4005003donadwell1
27675156train3755003Aaron Lincoln1

Evaluation Procedure

We use top-1 classification accuracy on the video files as the primary evaluation metric for SSW60. Please use the split column in the video_ml.csv to identify the video files that are marked for test. The distribution of the test videos is nearly uniform, so we use a simple form of top-1 accuracy: for each video $v$, an algorithm will produce one label $l_v$ and you should compare this label to the ground truth label for the video $g_v$, computing the accuracy score as:

$$ s_v = \begin{cases} 1 & \quad \text{if } l_v = g_v \ 0 & \quad \text{otherwise} \end{cases} $$

The overall accuracy score for an algorithm is the average accuracy over all $N$ test videos:

$$ \text{accuracy} = \frac{1}{N} \sum_{v} s_{v} $$

Training Procedure

We do not enforce a specific audiovisual classification model or training procedure. This is a fast moving research area with new ideas and datasets coming quickly. We expect researchers to be clear and forthright in describing all data (inlcuding "pretaining data" and "pretrained backbones") they used for training their models and the steps taken to produce their final audiovisual classification network. Researchers may find it useful to pretrain their models using the accompanying audio (audio_ml.csv) and image datasets (images_nabirds.csv, images_inat.csv). If this is done, we expect the train/test splits for those datasets to be respected, i.e we discourage using the test splits for training.

Best Results

We breifly describe the steps taken in our EECV 2022 paper to achieve the best results on SSW60. Please see the accompanying paper for details and specifics. We train two ImageNet pretrained ViT-B models, one to process audio, and one to process images. We then combine these models through score fusion. The steps taken:

  1. Pretrain the image classifier using the images_inat dataset.
  2. Pretrain the audio classifier using the audio_ml dataset.
  3. Fine-tune the audio classifier on the training videos.
  4. Use score fusion to combine the predictions of the audio and video classifiers on the video test set.

This method achieves 80.6% top-1 accuracy on the SSW60 video test set. Note that this process did not fine-tune the image classifier on the video frames of the SSW60 video train set.

Limitations

We attempt to document some limitations of the SSW60 dataset. Our goal here is to be upfront with fellow researchers, and to provide targets for future versions of this dataset (or others) to improve upon.

Small Size

The SSW60 dataset is relatively small, and therefore may not be appropriate for training a model "from scratch." Using a pretrained model (perhaps pretrained on ImageNet for visual information or AudioSet for audio information) is an easy way to mitigate this limitation.

Visual Bias

The original intent of this dataset was to study fine-grained classification using audiovisual data ("How can we improve bird species classification if we have video + audio? What are the relative merits of each modality? For which species is a particular modality more useful for classification?" etc.). In a perfect world, each video would have relevant visual and acoustic information that can be analyzed by a model. However, the fact that we are using videos contributed by humans (i.e. someone decided to record a bird with a video camera as opposed to only a microphone) means that there is an inherent bias towards visual information in the SSW60 dataset. To put it plainly: while all videos have frames containing visuals of the bird species under question, not all videos have an audio channel, and even if they do, the audio may not be relevant for classification. We attempt to identify those videos with relevant acoustic information and use the column reliable_audio in the video_ml.csv file to track this. However, this column was machine generated and might not accuractely reflect the utility of the audio channel for each video. For our best results, we train and evaluate using the audio channel for all videos regardless of the value of reliable_audio.

Geographically Variable Media

While all SSW60 species occur in Sapsucker Woods, not all media in SSW60 was recorded from Sapsucker Woods. This means that some training or testing media maybe more geographically diverse than is found in Sapsucker Woods. For example, some plumages might not be relevant to a bird's appearance in Sapsucker Woods, or some environments in the background of a piece of media might not resemble the woods (visually or aurally) of upstate New York.

Paper Citation

If you use the SSW60 dataset in your research, please cite:

@inproceedings{ssw602022eccv,
    author    = {Van Horn, Grant and Qian, Rui and Wilber, Kimberly and Adam, Hartwig and Mac Aodha, Oisin and Belongie, Serge},
    title     = {Exploring Fine-grained Audiovisual Categorization with the SSW60 Dataset},
    booktitle = {European Conference on Computer Vision (ECCV)},
    year      = {2022}
}

The first two authors contributed equally to this work.

Additional Bird Video Datasets

We are certainly not the first to build a video dataset focused on bird species. Please see the paper, particularly the supplementary material, for more details and comparisions.