Home

Awesome

Salient Object Detection in RGB-D Videos (RDVS dataset and DCTNet+ model, accepted to IEEE TIP)

This site is still under construction....

Code for paper: Salient Object Detection in RGB-D Videos [Arxiv]

Table of Contents

1 Task Relationship

<p align="center"> <img src="figures/rgbdvsod.png" width="70%" /> <br /> <em> Figure 1: Due to the limitation of using a single RGB/color modality (image) for SOD (termed RGB SOD), researchers have integrated scene depth information into the SOD task, often referred to as RGB-D SOD. Meanwhile, extending still images to the temporal case yields the video SOD (VSOD) task. We target at the RGB-D VSOD task, which can be deemed as extension from the prevalent RGB-D SOD and VSOD tasks. </em> </p>

To delve into such a potential task, and as one of the earliest works towards RGB-D VSOD, we contributes on two distinct aspects: 1) the dataset, and 2) the model.

2 Proposed Dataset: RDVS

We propose a new RGB-D Video Salient Object Dataset incorporating realistic depth information, and the dataset is named RDVS for short. RDVS contains 57 sequences, totaling 4,087 frames, and its annotation process is guided rigorously by gaze data captured from a professional eye-tracker. The collected video clips encompass various challenging scenarios, e.g., complex backgrounds, low contrast, occlusion, and heterogeneous objects. We also provide training and testing splits. Download the RDVS from "RDVS Dataset".

<p align="center"> <img src="figures/githubRDVS.png" width="100%" /> <br /> <em> Figure 2 shows Statistics of the proposed RDVS dataset. (a) Attribute-based analyses of RDVS with comparison to DAVIS. (b) The pairwise dependencies across different attributes. (c) Scene/object categories of RDVS. (d) Center bias of RDVS and existing VSOD datasets. <br /> </em> </p> &nbsp; <p align="center"> <img src="figures/Fig_fixation.png" width="100%" /> <br /> <em> Figure 3: Illustrative frames (with depth in the bottom-right) from RDVS with fixations (red dots, the top row) and the corresponding continuous saliency maps (overlaying on the RGB frames, the bottom row). </em> </p> &nbsp;

Watch the video

<p align="center"> <em> Click the above figure to watch saliency shift of all sequences in RDVS dataset (YouTube Link) </em> </p>

3 Proposed Model: DCTNet+

3.1 Overview

<p align="center"> <img src="figures/Overview.png" width="100%" /> <br /> <em> Figure 4. Overview of DCTNet+. (a) shows the big picture. (b) and (c) show the details of MAM and RFM, respectively. In the attention operations on the right-hand side in (c), since the coordinate attention and spatial attention processes are similar, the operations of spatial attention are represented in parentheses and are not repeated. </em> </p>

3.2 Usage

  1. Requirements

    • Python 3.9
    • PyTorch 1.12.1
    • Torchvision 0.13.1
    • Cuda 11.6
  2. Training

    • Download the pretrained ResNet34 backbone: Baidu Pan | Google Drive to './model/resnet/pre_train/'.

    • Download the train dataset (containing DAVIS16, DAVSOD, FBMS and DUTS-TR) from "Training set and test set" and save it at './dataset/train/*'.

    • Download the pretrained RGB, depth and flow stream models from Baidu Pan | Google Drive to './checkpoints/'

      • Noting: the pre_trained RGB should be saved at './checkpoints/spatial', pre_trained depth shoule be saved at './checkpoints/depth' and flow shoule be saved at './checkpoints/flow'.
    • The training of entire DCTNet+ utilized one NVIDIA RTX 3090 GPU to accelerate.

      • run python train.py in terminal
    • (PS: For pretraining different streams)

      • The pretraining code of different streams can be derived from train.py. We provide pretrain_depth.py and it can also be modified for pretraining the other two streams.
  3. Testing

    • Download the test data (containing DAVIS16, DAVSOD, FBMS, SegTrack-V2, VOS) from "Training set and test set" and save it at './dataset/test/*'
    • Download the trained model from "DCTNet+ model"(original model ckpt) and modify the model_path to its saving path in the test.py.
    • Run python test.py in the terminal.

4 Downloads

4.1 RDVS dataset

4.2 DCTNet+ model

4.3 Training set and test set

4.4 Saliency Maps on RDVS dataset

4.5 Saliency Maps on five benchmark datasets (pseudo RGB-D video datasets)

5 Results

5.1 Quantitative comparison on 5 benchmark datasets

<em> Table 1. Quantitative comparison with state-of-the-art VSOD methods on 5 benchmark datasets. The top three results are represented in red, green, and blue from top to bottom, respectively. ↑/↓ denotes that the larger/smaller value is better. Symbol `**' means that results are not available. </em> <p align="center"> <img src="figures/sota.png" width="100%" /> <br /> </p>

5.2 Qualitative comparison

<p align="center"> <img src="figures/VisualComparison.png" width="100%" /> <br /> <em> Figure 5. Qualitative comparison of our DCTNet+ model and SOTA methods on conventional VSOD benchmarks. </em> </p>

5.3 Straightforward evaluation on the full RDVS dataset

<em> Table 2. Results of SOTA methods in different fields and the proposed method on RDVS dataset, where the suffix "⋇" indicates RGB-D VSOD techniques, ↑/↓ denotes that the larger/smaller value is better. The best are stressed in BOLD. </em> <p align="center"> <img src="figures/sotaOnRDVS.png" width="70%" /> <br /> </p> &nbsp; <p align="center"> <img src="figures/RDVSVisual.png" width="100%" /> <br /> <em> Figure 6. Qualitative comparison on the proposed RDVS dataset. </em> </p>

5.4 Evaluation on RDVS test set after fine-tuning

<em> Table 3. Results of SOTA methods in different fields as well as the proposed method on RDVS testing set. The left half are the results of the original model applied directly on the RDVS testing set, and the right half are the results obtained by re-training the models consistently on the RDVS training set and then evaluating them on the RDVS testing set. The best are stressed in BOLD. </em> <p align="center"> <img src="figures/RDVStestset.png" width="70%" /> <br /> </p>

5.5 Synthetic depth v.s. realistic depth

<em> Table 4. Experimental results of comparing synthetic depth maps and realistic depth maps by applying the original models to the full RDVS dataset. The best are stressed in BOLD. </em> <p align="center"> <img src="figures/synvsreal1.png" width="70%" /> <br /> </p> <em> Table 5. Experimental results of comparing synthetic depth maps and realistic depth maps by fine-tuning the models on the RDVS training set. The best are stressed in BOLD. </em> <p align="center"> <img src="figures/synvsreal2.png" width="70%" /> <br /> </p>

6 Citation

Please cite our paper if you find the work useful:

@inproceedings{lu2022depth,
  title={Depth-cooperated trimodal network for video salient object detection},
  author={Lu, Yukang and Min, Dingyao and Fu, Keren and Zhao, Qijun},
  booktitle={2022 IEEE International Conference on Image Processing (ICIP)},
  pages={116--120},
  year={2022},
  organization={IEEE}
}

@misc{mou2023RDVS,
      title={Salient Object Detection in RGB-D Videos}, 
      author={Ao Mou and Yukang Lu and Jiahao He and Dingyao Min and Keren Fu and Qijun Zhao},
      year={2023},
      eprint={2310.15482},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

7 Reference

We sincerely thank MPI Sintel, WSVD, Stereo Ego-Motion, SBM-RGBD and TUM-RGBD for their outstanding contributions on datasets!

@inproceedings{butler2012naturalistic,
  title={A naturalistic open source movie for optical flow evaluation},
  author={Butler, Daniel J and Wulff, Jonas and Stanley, Garrett B and Black, Michael J},
  booktitle={Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12},
  pages={611--625},
  year={2012},
  organization={Springer}
}

@inproceedings{wang2019web,
  title={Web stereo video supervision for depth prediction from dynamic scenes},
  author={Wang, Chaoyang and Lucey, Simon and Perazzi, Federico and Wang, Oliver},
  booktitle={2019 International Conference on 3D Vision (3DV)},
  pages={348--357},
  year={2019},
  organization={IEEE}
}

@inproceedings{stereego,
  title={https://lmb.informatik.uni-freiburg.
de/resources/datasets/StereoEgomotion/},
  author={{Stereo Ego-Motion dataset}},
}

@inproceedings{camplani2017benchmarking,
  title={A benchmarking framework for background subtraction in RGBD videos},
  author={Camplani, Massimo and Maddalena, Lucia and Moy{\'a} Alcover, Gabriel and Petrosino, Alfredo and Salgado, Luis},
  booktitle={New Trends in Image Analysis and Processing--ICIAP 2017: ICIAP International Workshops, WBICV, SSPandBE, 3AS, RGBD, NIVAR, IWBAAS, and MADiMa 2017, Catania, Italy, September 11-15, 2017, Revised Selected Papers 19},
  pages={219--229},
  year={2017},
  organization={Springer}
}

@inproceedings{sturm2012benchmark,
  title={A benchmark for the evaluation of RGB-D SLAM systems},
  author={Sturm, J{\"u}rgen and Engelhard, Nikolas and Endres, Felix and Burgard, Wolfram and Cremers, Daniel},
  booktitle={2012 IEEE/RSJ international conference on intelligent robots and systems},
  pages={573--580},
  year={2012},
  organization={IEEE}
}

⬆ back to top