Home

Awesome

PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection (TOMM 2022)

Object-level audiovisual saliency detection in 360° panoramic real-life dynamic scenes is important for exploring and modeling human perception in immersive environments, also for aiding the development of virtual, augmented and mixed reality applications in the fields of such as education, social network, entertainment and training. To this end, we propose a new task, panoramic audiovisual salient object detection (PAV-SOD), which aims to segment the objects grasping most of the human attention in 360° panoramic videos reflecting real-life daily scenes. To support the task, we collect PAVS10K, the first panoramic video dataset for audiovisual salient object detection, which consists of 67 4K-resolution equirectangular videos with per-video labels including hierarchical scene categories and associated attributes depicting specific challenges for conducting PAV-SOD, and 10,465 uniformly sampled video frames with manually annotated object-level and instance-level pixel-wise masks. The coarse-to-fine annotations enable multi-perspective analysis regarding PAV-SOD modeling. We further systematically benchmark 13 state-of-the-art salient object detection (SOD)/video object segmentation (VOS) methods based on our PAVS10K. Besides, we propose a new baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE). Our CVAE-based audiovisual network, namely CAV-Net, consists of a spatial-temporal visual segmentation network, a convolutional audio-encoding network and audiovisual distribution estimation modules. As a result, our CAV-Net outperforms all competing models and is able to estimate the aleatoric uncertainties within PAVS10K. With extensive experimental results, we gain several findings about PAV-SOD challenges and insights towards PAV-SOD model interpretability. We hope that our work could serve as a starting point for advancing SOD towards immersive media.


PAVS10K

<p align="center"> <img src="./figures/fig_teaser.jpg"/> <br /> <em> Figure 1: An example of our PAVS10K where coarse-to-fine annotations are provided, based on a guidance of fixations acquired from subjective experiments conducted by multiple (N) subjects wearing Head-Mounted Displays (HMDs) and headphones. Each (e.g., fk, fl and fn, where random integral values {k, l, n} ∈ [1, T ]) of the total equirectangular (ER) video frames T of the sequence “Speaking”(Super-class)-“Brothers”(sub-class) are manually labeled with both object-level and instance-level pixel-wise masks. According to the features of defined salient objects within each of the sequences, multiple attributes, e.g., “multiple objects” (MO), “competing sounds” (CS), “geometrical distortion” (GD), “motion blur” (MB), “occlusions” (OC) and “low resolution” (LR) are further annotated to enable detailed analysis for PAV-SOD modeling. </em> </p> <p align="center"> <img src="./figures/fig_related_datasets.jpg"/> <br /> <em> Figure 2: Summary of widely used salient object detection (SOD)/video object segmentation (VOS) datasets and PAVS10K. #Img: The number of images/video frames. #GT: The number of object-level pixel-wise masks (ground truth for SOD). Pub. = Publication. Obj.-Level = Object-Level Labels. Ins.-Level = Instance-Level Labels. Fix. GT = Fixation Maps. † denotes equirectangular images. </em> </p> <p align="center"> <img src="./figures/fig_dataset_examples.jpg"/> <br /> <em> Figure 3: Examples of challenging attributes on equirectangular images from our PAVS10K, with instance-level ground truth and fixations as annotation guidance. {𝑓𝑘, 𝑓𝑙, 𝑓𝑛} denote random frames of a given video. </em> </p> <p align="center"> <img src="./figures/fig_dataset_statistics.jpg"/> <br /> <em> Figure 4: Statistics of the proposed PAVS10K. (a) Super-/sub-category information. (b) Instance density (labeled frames per sequence) of each sub-class. (c) Sound sources of PAVS10K scenes, such as musical instruments, human instances and animals. </em> </p>

Benchmark Models

No.YearPub.TitleLinks
012019CVPRCascaded Partial Decoder for Fast and Accurate Salient Object DetectionPaper/Code
022019CVPRSee More, Know More: Unsupervised Video Object Segmentation with Co-Attention Siamese NetworksPaper/Code
032019ICCVStacked Cross Refinement Network for Edge-Aware Salient Object DetectionPaper/Code
042019ICCVSemi-Supervised Video Salient Object Detection Using Pseudo-LabelsPaper/Code
052020AAAIF³Net: Fusion, Feedback and Focus for Salient Object DetectionPaper/Code
062020AAAIPyramid Constrained Self-Attention Network for Fast Video Salient Object DetectionPaper/Code
072020CVPRMulti-scale Interactive Network for Salient Object DetectionPaper/Code
082020CVPRLabel Decoupling Framework for Salient Object DetectionPaper/Code
092020ECCVHighly Efficient Salient Object Detection with 100K ParametersPaper/Code
102020ECCVSuppress and Balance: A Simple Gated Network for Salient Object DetectionPaper/Code
112020BMVCMaking a Case for 3D Convolutions for Object Segmentation in VideosPaper/Code
122020SPLFANet: Features Adaptation Network for 360° Omnidirectional Salient Object DetectionPaper/Code
132021CVPRReciprocal Transformations for Unsupervised Video Object SegmentationPaper/Code

CAV-Net

The codes are available at src.

The pre-trained models can be downloaded at Google Drive.


Dataset Downloads

The whole object-/instance-level ground truth with default split can be downloaded from Google Drive.

The videos (with ambisonics) with default split can be downloaded from Google Drive.

The videos (with mono sound) can be downloaded from Google Drive

The audio files (.wav) can be downloaded from Google Drive.

The head movement and eye fixation data can be downloaded from Google Drive.

To generate video frames, please refer to video_to_frames.py.

To get access to raw videos on YouTube, please refer to video_seq_link.

Note: The PAVS10K dataset does not own the copyright of videos. Only researchers and educators who wish to use the videos for non-commercial researches and/or educational purposes, have access to PAVS10K.


Citation

@article{zhang2023pav,
  title={PAV-SOD: A New Task towards Panoramic Audiovisual Saliency Detection},
  author={Zhang, Yi and Chao, Fang-Yi and Hamidouche, Wassim and Deforges, Olivier},
  journal={ACM Transactions on Multimedia Computing, Communications and Applications},
  volume={19},
  number={3},
  pages={1--26},
  year={2023},
  publisher={ACM New York, NY}
}

Contact

yi23zhang.2022@gmail.com or fangyichao428@gmail.com (for details of head movement and eye fixation data).