Home

Awesome

DCTNet: Depth-Cooperated Trimodal Network for Video Salient Object Detection

This repository provides PyTorch code implementation for DCTNet: Depth-Cooperated Trimodal Network for Video Salient Object Detection [Arxiv]

News: The new RDVS dataset, whose depth of videos is realistic, is coming!!!

<p align="center"> <img src="pictures/Fig_overview.png" width="100%"/> <br /> <em> Overview of DCTNet. (a) shows the big picture. (b) and (c) show the details of MAM and RFM, respectively. </em> </p> <p align="center"> <img src="pictures/Fig_visual_compare_nodepth.png" width="70%"/> <br /> <em> Effectiveness of leveraging depth to assist VSOD. OF denotes optical flow, and GT represents ground truth. <br /> Column (e) and (f) are predictions from our full model (with depth) and its variant (without depth), respectively. </em> </p>

Requirements

Usage

Training

  1. Download the pre_trained ResNet34 backbone to './model/resnet/pre_train/'.
  2. Download the train dataset (containing DAVIS16, DAVSOD, FBMS and DUTS-TR) from Baidu Driver (PSW: 7yer) and save it at './dataset/train/*'.
  3. Following instructions of RAFT to prepare the optical flow and instructions of DPT to prepare the synthetic depth map.(Both optical flow map and synthetic depth map are also available from our dataset link)
  4. Download the pre_trained RGB, depth and flow stream models from Baidu Driver (PSW: 8lux) to './checkpoints/'.
  5. The training of entire DCTNet is implemented on two NVIDIA TiTAN X GPUs.
    • run CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 train.py in terminal

(PS: For pretraining different streams)

Testing

  1. Download the test data (containing DAVIS16, DAVSOD, FBMS, SegTrack-V2, VOS) from Baidu Driver (PSW: 8uh3) and save it at './dataset/test/*'

  2. Download the trained model from Baidu Driver (PSW: lze1) and modify the model_path to its saving path in the test.py.

  3. Run python test.py in the terminal.

In addition to the one reported in the paper, we also provide different versions with two more training dataset-combos, including DAVIS + FBMS, and DAVIS + DAVSOD.

DAVIS + FBMS

Models with "*" are traditional methods, MGAN and FSNet are trained and fine-tuned on DAVIS and FBMS. The comparison results are below. Download the trained model from Baidu Driver (PSW: l3q2)

DatasetsMetricsMSTM*STBP*SFLR*SCOM*MGANFSNetOurs
DAVISmaxF0.3950.4850.6980.7460.8930.9070.912
S-measure0.5660.6510.7710.8140.9130.9200.924
MAE0.1740.1050.0600.0550.0220.0200.014
DAVSODmaxF0.3470.4080.4820.4730.6620.6850.691
S-measure0.5300.5630.6220.6030.7570.7730.782
MAE0.2140.1650.1360.2190.0790.0720.068
FBMSmaxF0.5000.5950.6600.7970.9090.8880.909
S-measure0.6130.6270.6990.7940.9120.8900.916
MAE0.1770.1520.1170.0790.0260.0410.024
SegTrack-V2maxF0.5260.6400.7450.7640.8400.8060.826
S-measure0.6430.7350.8040.8150.8950.8700.887
MAE0.1140.0610.0370.0300.0240.0250.034
VOSmaxF0.5670.5260.5460.6900.7430.6590.764
S-measure0.6570.5760.6240.7120.8070.7030.831
MAE0.1440.1630.1450.1620.0690.1030.061

DAVIS + DAVSOD

SSAV,PCSA and TENet are trained and fine-tuned on DAVIS and DAVSOD. The comparison results are below. Download the trained model from Baidu Driver (PSW: srwu)

DatasetsMetricsMSTM*STBP*SFLR*SCOM*SSAVPCSATENetOurs
DAVISmaxF0.3950.4850.6980.7460.8610.8800.8940.904
S-measure0.5660.6510.7710.8140.8930.9020.9050.917
MAE0.1740.1050.0600.0550.0280.0220.0210.016
DAVSODmaxF0.3470.4080.4820.4730.6030.6560.6480.695
S-measure0.5300.5630.6220.6030.7240.7410.7530.778
MAE0.2140.1650.1360.2190.0920.0860.0780.069
FBMSmaxF0.5000.5950.6600.7970.8650.8370.8870.883
S-measure0.6130.6270.6990.7940.8790.8680.9100.886
MAE0.1770.1520.1170.0790.0400.0400.0270.032
SegTrack-V2maxF0.5260.6400.7450.7640.7980.811**0.839
S-measure0.6430.7350.8040.8150.8510.866**0.886
MAE0.1140.0610.0370.0300.0230.024**0.014
VOSmaxF0.5670.5260.5460.6900.7420.747**0.772
S-measure0.6570.5760.6240.7120.8190.828**0.837
MAE0.1440.1630.1450.1620.0740.065**0.058

For evaluation:

  1. The saliency maps can be download from Baidu Driver (PSW: wfqc)
  2. Evaluation Toolbox: We use the standard evaluation toolbox from DAVSOD benchmark.

A new RGB-D VSOD dataset (with realistic depth, coming now):

We have constructed a new RGB-D VSOD dataset, whose depth is realistic, rather synthesized. See the links for the new RDVS dataset and the paper.

Citation

Please cite our paper if you find this work useful:

@inproceedings{lu2022depth,
  title={Depth-Cooperated Trimodal Network for Video Salient Object Detection},
  author={Lu, Yukang and Min, Dingyao and Fu, Keren and Zhao, Qijun},
  booktitle={2022 IEEE International Conference on Image Processing (ICIP)},
  year={2022},
  organization={IEEE}
}