Awesome

Salient Object Detection in RGB-D Videos (RDVS dataset and DCTNet+ model, accepted to IEEE TIP)

This site is still under construction....

Code for paper: Salient Object Detection in RGB-D Videos [Arxiv]

RDVS dataset and DCTNet+ model <a name="headin"></a>

1 Task Relationship

<img src="figures/rgbdvsod.png" width="70%" /> Figure 1: Due to the limitation of using a single RGB/color modality (image) for SOD (termed RGB SOD), researchers have integrated scene depth information into the SOD task, often referred to as RGB-D SOD. Meanwhile, extending still images to the temporal case yields the video SOD (VSOD) task. We target at the RGB-D VSOD task, which can be deemed as extension from the prevalent RGB-D SOD and VSOD tasks.

To delve into such a potential task, and as one of the earliest works towards RGB-D VSOD, we contributes on two distinct aspects: 1) the dataset, and 2) the model.

2 Proposed Dataset: RDVS

We propose a new RGB-D Video Salient Object Dataset incorporating realistic depth information, and the dataset is named RDVS for short. RDVS contains 57 sequences, totaling 4,087 frames, and its annotation process is guided rigorously by gaze data captured from a professional eye-tracker. The collected video clips encompass various challenging scenarios, e.g., complex backgrounds, low contrast, occlusion, and heterogeneous objects. We also provide training and testing splits. Download the RDVS from "RDVS Dataset".

<img src="figures/githubRDVS.png" width="100%" /> Figure 2 shows Statistics of the proposed RDVS dataset. (a) Attribute-based analyses of RDVS with comparison to DAVIS. (b) The pairwise dependencies across different attributes. (c) Scene/object categories of RDVS. (d) Center bias of RDVS and existing VSOD datasets.   <img src="figures/Fig_fixation.png" width="100%" /> Figure 3: Illustrative frames (with depth in the bottom-right) from RDVS with fixations (red dots, the top row) and the corresponding continuous saliency maps (overlaying on the RGB frames, the bottom row).

Click the above figure to watch saliency shift of all sequences in RDVS dataset (YouTube Link)

3 Proposed Model: DCTNet+

3.1 Overview

<img src="figures/Overview.png" width="100%" /> Figure 4. Overview of DCTNet+. (a) shows the big picture. (b) and (c) show the details of MAM and RFM, respectively. In the attention operations on the right-hand side in (c), since the coordinate attention and spatial attention processes are similar, the operations of spatial attention are represented in parentheses and are not repeated.

3.2 Usage

Requirements
- Python 3.9
- PyTorch 1.12.1
- Torchvision 0.13.1
- Cuda 11.6
Training
- Download the pretrained ResNet34 backbone: Baidu Pan | Google Drive to './model/resnet/pre_train/'.
- Download the train dataset (containing DAVIS16, DAVSOD, FBMS and DUTS-TR) from "Training set and test set" and save it at './dataset/train/*'.
- Download the pretrained RGB, depth and flow stream models from Baidu Pan | Google Drive to './checkpoints/'
  - Noting: the pre_trained RGB should be saved at './checkpoints/spatial', pre_trained depth shoule be saved at './checkpoints/depth' and flow shoule be saved at './checkpoints/flow'.
- The training of entire DCTNet+ utilized one NVIDIA RTX 3090 GPU to accelerate.
  - run python train.py in terminal
- (PS: For pretraining different streams)
  - The pretraining code of different streams can be derived from train.py. We provide pretrain_depth.py and it can also be modified for pretraining the other two streams.
Testing
- Download the test data (containing DAVIS16, DAVSOD, FBMS, SegTrack-V2, VOS) from "Training set and test set" and save it at './dataset/test/*'
- Download the trained model from "DCTNet+ model"(original model ckpt) and modify the model_path to its saving path in the test.py.
- Run python test.py in the terminal.

4 Downloads

4.1 RDVS dataset

Full dataset with realistic depth (4.84G, 57 sequences): Baidu Pan | Google Drive (Update link:2023-10-23)
Full dataset with synthetic depth (4.46G, 57 sequences): Baidu Pan (Update link:2023-10-23)
Training Set containing realistic and synthetic depth (2.56G, 32 sequences): Baidu Pan | Google Drive (Update link:2023-10-23)
Test Set containing realistic and synthetic depth (2.30G, 25 sequences): Baidu Pan | Google Drive (Update link:2023-10-23)
Noting: realistic depth is in "/Depth" and synthetic depth is in "/SyntheticDepth"

4.2 DCTNet+ model

Original model ckpt: Baidu Pan | Google Drive (Update link:2023-10-23)
Finetune on the test set of RDVS with realistic depth: Baidu Pan | Google Drive (Update link:2023-10-23)
Finetune on the test set of RDVS with synthetic depth: Baidu Pan | Google Drive (Update link:2023-10-23)

4.3 Training set and test set

Noting: The pseudo RGB-D video datasets used for our model training and testing.
Training set: Baidu Pan (Update link:2023-10-23)
Test set: Baidu Pan (Update link:2023-10-23)

4.4 Saliency Maps on RDVS dataset

Noting: including RGB-D models, VSOD models, DCTNet and our DCTNet+(last line). (Update link:2023-10-23)

Year	Publisher	Paper	Model	DownloadLink1	DownloadLink2
RGB-D SOD Models
2020	ECCV	BBSNet	Code	Baidu	Google
2020	CVPR	JLDCF	Code	Baidu	Google
2020	CVPR	S2MA	Code	Baidu	Google
2020	ECCV	HDFNet	Code	Baidu	Google
2020	TIP	DPANet	Code	Baidu	Google
2021	ICCV	SPNet	Code	Baidu	Google
2021	TIP	CDNet	Code	Baidu	Google
2021	CVPR	DCF	Code	Baidu	Google
2021	ACMMM	TriTransNet	Code	Baidu	Google
2021	ICME	BTSNet	Code	Baidu	Google
2022	TNNLS	RD3D	Code	Baidu	Google
2022	TIP	CIRNet	Code	Baidu	Google
2023	ACMMM	PICRNet	Code	Baidu	Google
2023	TCSVT	HRTransNet	Code	Baidu	Google
VSOD Models
2018	ECCV	PDB	Code	Baidu	Google
2019	ICCV	MGAN	Code	Baidu	Google
2019	CVPR	SSAV	Code	Baidu	Google
2020	AAAI	PCSA	Code	Baidu	Google
2021	ICCV	FSNet	Code	Baidu	Google
2021	ICCV	DCFNet	Code	Baidu	Google
RGB-D VSOD Models
2022	ICIP	DCTNet	Code	Baidu	Google
--	--	DCTNet+	--	Baidu	Google

4.5 Saliency Maps on five benchmark datasets (pseudo RGB-D video datasets)

Noting: including DAVIS, DAVSOD-easy, FBMS, SegTrack-V2 and VOS. (Update link:2023-10-23) Other results before 2019 can be redirected to DAVSOD.

Year	Publisher	Paper	Model	DownloadLink1	DownloadLink2
RGB-D SOD Models
2019	ICCV	MGAN	Code	Baidu	Google
2020	AAAI	PCSA	Code	Baidu	Google
2020	ECCV	TENet	Code	Baidu	Google
2021	ICCV	FSNet	Code	Baidu	Google
2021	ICCV	DCFNet	Code	Baidu	Google
2022	NIPS	UGPL	Code	Baidu	Google
2022	SPL	MGTNet	Code	Baidu	Google
2023	TNNLS	CoSTFormer	--	Baidu	Google
RGB-D VSOD Models
2022	ICIP	DCTNet	Code	Baidu	Google
--	--	DCTNet+	--	Baidu	Google

5 Results

5.1 Quantitative comparison on 5 benchmark datasets

Table 1. Quantitative comparison with state-of-the-art VSOD methods on 5 benchmark datasets. The top three results are represented in red, green, and blue from top to bottom, respectively. ↑/↓ denotes that the larger/smaller value is better. Symbol `**' means that results are not available. <img src="figures/sota.png" width="100%" />

5.2 Qualitative comparison

<img src="figures/VisualComparison.png" width="100%" /> Figure 5. Qualitative comparison of our DCTNet+ model and SOTA methods on conventional VSOD benchmarks.

5.3 Straightforward evaluation on the full RDVS dataset

Table 2. Results of SOTA methods in different fields and the proposed method on RDVS dataset, where the suffix "⋇" indicates RGB-D VSOD techniques, ↑/↓ denotes that the larger/smaller value is better. The best are stressed in BOLD. <img src="figures/sotaOnRDVS.png" width="70%" />   <img src="figures/RDVSVisual.png" width="100%" /> Figure 6. Qualitative comparison on the proposed RDVS dataset.

5.4 Evaluation on RDVS test set after fine-tuning

Table 3. Results of SOTA methods in different fields as well as the proposed method on RDVS testing set. The left half are the results of the original model applied directly on the RDVS testing set, and the right half are the results obtained by re-training the models consistently on the RDVS training set and then evaluating them on the RDVS testing set. The best are stressed in BOLD. <img src="figures/RDVStestset.png" width="70%" />

5.5 Synthetic depth v.s. realistic depth

Table 4. Experimental results of comparing synthetic depth maps and realistic depth maps by applying the original models to the full RDVS dataset. The best are stressed in BOLD. <img src="figures/synvsreal1.png" width="70%" /> Table 5. Experimental results of comparing synthetic depth maps and realistic depth maps by fine-tuning the models on the RDVS training set. The best are stressed in BOLD. <img src="figures/synvsreal2.png" width="70%" />

6 Citation

Please cite our paper if you find the work useful:

@inproceedings{lu2022depth,
  title={Depth-cooperated trimodal network for video salient object detection},
  author={Lu, Yukang and Min, Dingyao and Fu, Keren and Zhao, Qijun},
  booktitle={2022 IEEE International Conference on Image Processing (ICIP)},
  pages={116--120},
  year={2022},
  organization={IEEE}
}

@misc{mou2023RDVS,
      title={Salient Object Detection in RGB-D Videos}, 
      author={Ao Mou and Yukang Lu and Jiahao He and Dingyao Min and Keren Fu and Qijun Zhao},
      year={2023},
      eprint={2310.15482},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

7 Reference

We sincerely thank MPI Sintel, WSVD, Stereo Ego-Motion, SBM-RGBD and TUM-RGBD for their outstanding contributions on datasets!

@inproceedings{butler2012naturalistic,
  title={A naturalistic open source movie for optical flow evaluation},
  author={Butler, Daniel J and Wulff, Jonas and Stanley, Garrett B and Black, Michael J},
  booktitle={Computer Vision--ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI 12},
  pages={611--625},
  year={2012},
  organization={Springer}
}

@inproceedings{wang2019web,
  title={Web stereo video supervision for depth prediction from dynamic scenes},
  author={Wang, Chaoyang and Lucey, Simon and Perazzi, Federico and Wang, Oliver},
  booktitle={2019 International Conference on 3D Vision (3DV)},
  pages={348--357},
  year={2019},
  organization={IEEE}
}

@inproceedings{stereego,
  title={https://lmb.informatik.uni-freiburg.
de/resources/datasets/StereoEgomotion/},
  author={{Stereo Ego-Motion dataset}},
}

@inproceedings{camplani2017benchmarking,
  title={A benchmarking framework for background subtraction in RGBD videos},
  author={Camplani, Massimo and Maddalena, Lucia and Moy{\'a} Alcover, Gabriel and Petrosino, Alfredo and Salgado, Luis},
  booktitle={New Trends in Image Analysis and Processing--ICIAP 2017: ICIAP International Workshops, WBICV, SSPandBE, 3AS, RGBD, NIVAR, IWBAAS, and MADiMa 2017, Catania, Italy, September 11-15, 2017, Revised Selected Papers 19},
  pages={219--229},
  year={2017},
  organization={Springer}
}

@inproceedings{sturm2012benchmark,
  title={A benchmark for the evaluation of RGB-D SLAM systems},
  author={Sturm, J{\"u}rgen and Engelhard, Nikolas and Endres, Felix and Burgard, Wolfram and Cremers, Daniel},
  booktitle={2012 IEEE/RSJ international conference on intelligent robots and systems},
  pages={573--580},
  year={2012},
  organization={IEEE}
}

⬆ back to top