Awesome
NVDS (ICCV 2023) & NVDS+ (TPAMI 2024) πππ
πππ Welcome to the NVDS GitHub repository! πππ
The repository is official PyTorch implementation of ICCV 2023 paper "Neural Video Depth Stabilizer" (NVDS)
Authors: Yiran Wang<sup>1</sup>, Min Shi<sup>1</sup>, Jiaqi Li<sup>1</sup>, Zihao Huang<sup>1</sup>, Zhiguo Cao<sup>1</sup>, Jianming Zhang<sup>2</sup>, Ke Xian<sup>3</sup>, Guosheng Lin<sup>3</sup>
Project Page | Arxiv | Video | θ§ι’ | Poster | Supp | VDW Dataset | VDW Toolkits
TPAMI 2024 "NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation" (NVDS+)
Authors: Yiran Wang<sup>1</sup>, Min Shi<sup>1</sup>, Jiaqi Li<sup>1</sup>, Chaoyi Hong<sup>1</sup>, Zihao Huang<sup>1</sup>, Juewen Peng<sup>3</sup>, Zhiguo Cao<sup>1</sup>, Jianming Zhang<sup>2</sup>, Ke Xian<sup>1</sup>, Guosheng Lin<sup>3</sup>
Institutes: <sup>1</sup>Huazhong University of Science and Technology, <sup>2</sup>Adobe Research, <sup>3</sup>Nanyang Technological University
Paper | Arxiv | Video | θ§ι’ | Supp
π Highlights
NVDS is the first plug-and-play stabilizer that can remove flickers from any single-image depth model without extra effort. Besides, we also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset. Don't forget to star this repo if you find it interesting!
π¦ License and Releasing Policy
-
VDW dataset.
We have released the VDW dataset under strict conditions. We must ensure that the releasing wonβt violate any copyright requirements. To this end, we will not release any video frames or the derived data in public. Instead, we provide meta data and detailed toolkits, which can be used to reproduce VDW or generate your own data. The meta data contains IMDB numbers, starting time, end time, movie durations, resolutions, and cropping areas. All the meta data and toolkits are licensed under CC BY-NC-SA 4.0, which can only be used for academic and research purposes. Please refer to our VDW official website and VDW Toolkits for data usage.
-
NVDS code and model.
Following MiDaS and CVD, NVDS model adopts the widely-used MIT License.
β‘ Updates and Todo List
- [2024.10.04] The TPAMI 2024 paper of our NVDS+ is presented in Arxiv and IEEE.
- [2024.10.02] The extension NVDS+οΌTowards Efficient and Versatile Neural Stabilizer for Video Depth Estimation is accepted by TPAMI 2024!.
- [2024.06.03] The VDW official toolkits to reproduce VDW and generate your own data.
- [2024.01.22] We release the supplementary video for the journal extension from NVDS to NVDS+.
- [2024.01.22] The metadata and evaluation code of the VDW test set.
- [2023.09.17] Upload NVDS Poster for ICCV2023.
- [2023.09.09] Evaluation code on VDW test set is released.
- [2023.09.09] VDW official website goes online.
- [2023.08.11] Release evaluation code and checkpoint of NYUDV2-finetuned NVDS.
- [2023.08.10] Update the camera ready version of NVDS paper and supplementary.
- [2023.08.05] Update license of VDW dataset: CC BY-NC-SA 4.0.
- [2023.07.21] We present the NVDS checkpoint and demo (inference) code.
- [2023.07.18] Our Project Page is built and released.
- [2023.07.18] The Arxiv version of our NVDS paper is released.
- [2023.07.16] Our work is accepted by ICCV2023.
πΌ Abstract
Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research.
<p align="center"> <img src="PDF/fig1-pipeline.PNG" width="100%"> </p>π¨ Installation
-
Basic environment.
Our code is based on
python=3.8.13
andpytorch==1.9.0
. Refer to therequirements.txt
for installation.conda create -n NVDS python=3.8.13 conda activate NVDS conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c conda-forge pip install numpy imageio opencv-python scipy tensorboard timm scikit-image tqdm glob h5py
-
Installation of GMflow.
We utilize state-of-the-art optical flow model GMFlow in the temporal loss and the OPW metric. The temporal loss is used to enhance consistency while training. The OPW metric is evaluated in our demo (inference) code to showcase quantitative improvements. <br>Please refer to the GMFlow Official Repo for the installation.
-
Installation of mmcv and mmseg.
Cross attention in our stabilization network contains functions based on
mmcv-full==1.3.0
andmmseg==0.11.0
. Please refer to MMSegmentation-v0.11.0 and their official document for detailed installation instructions step by step. The key is to match the version of mmcv-full and mmsegmentation with the version of cuda and pytorch on your server. For instance, I haveCUDA 11.1
andPyTorch 1.9.0
on my server, thusmmcv-full 1.3.x
andmmseg 0.11.0
(as in our installation instructions) are compatible with my environment (confirmed by mmcv-full 1.3.x). Different servers adopt different Cuda versions, thus I can not specify the specific installation for all people. You should check the matching version of your own server on the official documents of mmcv-full and mmseg. You can choose different versions in their documents and check the version matching relations. By reading and following the detailed mmcv-full and mmseg documents, the installation seems to be easy. You can also refer to Issue #1 for some discussions.Besides, we suggest you to install
mmcv-full==1.x.x
, because some API or functions are removed inmmcv-full==2.x.x
(you need to adjust our code for mmcv-full==2.x.x).
π₯ Demo & Inference
-
Preparing Demo Videos.
We put 8 demo input videos in
demo_videos
folder, in whichbandage_1
andmarket_6
are examples of MPI Sintel dataset.motocross-jump
is from DAVIS dataset. Others are a few examples of our VDW test dataset. You can also prepare your own testing sequences like us. -
Downloading checkpoints of depth predictors.
In our demo, we adopt MiDaS and DPT as different depth predictors. We use midas_v21-f6b98070.pt and dpt_large-midas-2f21e586.pt. Download those checkpoints and put them in
dpt/checkpoints/
folder. You may need to modify the MiDaS checkpoint name (midas_v21_384.pt) or our code (midas_v21-f6b98070.pt) since its name is adjusted by the MiDaS repo. -
Preparing checkpoint of NVDS Stabilizer.
Download and put the
NVDS_Stabilizer.pth
inNVDS_checkpoints/
folder. -
Running NVDS Inference Demo.
infer_NVDS_dpt_bi.py
andinfer_NVDS_midas_bi.py
use DPT and Midas as depth predictors. Those scripts contain: (1) NVDS Bidirectional Inference; (2) OPW Metric Evaluations with GMFlow. The only difference between those two scripts is the depth predictor. For running the code, taking DPT as an example, the basic command is:CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir /XXX/XXX --vnum XXX --infer_w XXX --infer_h XXX
--base_dir
represents the folder to save results.--vnum
refer to the video numbers or names.--infer_w
and--infer_h
are the width and height for inference. We use--infer_h 384
by default. The--infer_w
is set to maintain the aspect ratio of original videos. Besides, the--infer_w
and--infer_h
should be set to integer multiples of32
for alignment of resolutions in the up-sampling and down-sampling processes.Specifically, for the videos of VDW test dataset (
000423
as an example):CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/000423/ --vnum 000423 --infer_w 896 --infer_h 384 CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/000423/ --vnum 000423 --infer_w 896 --infer_h 384
For the videos of Sintel dataset (
market_6
as an example):CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/market_6/ --vnum market_6 --infer_w 896 --infer_h 384 CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/market_6/ --vnum market_6 --infer_w 896 --infer_h 384
For the videos of DAVIS dataset (
motocross-jump
as an example):CUDA_VISIBLE_DEVICES=0 python infer_NVDS_dpt_bi.py --base_dir ./demo_outputs/dpt_init/motocross-jump/ --vnum motocross-jump --infer_w 672 --infer_h 384 CUDA_VISIBLE_DEVICES=0 python infer_NVDS_midas_bi.py --base_dir ./demo_outputs/midas_init/motocross-jump/ --vnum motocross-jump --infer_w 672 --infer_h 384
Under the resolution of $896\times384$, the inference of DPT-Large and our stabilizer takes about 20G and 5G GPU memory (RTX-A6000). If the GPU memory or inference latency seems large for your applications, you can (1) run DPT/Midas initial depth results and our NVDS separately; (2) reduce the inference resolution ($e.g.$, $384\times384$); (3) if not needed, remove the OPW evaluations, in which the inference of GMFlow also brings some computational costs. (4) if not needed, remove the bidirectional (backward and mixing) inference. The forward inference process can also produce satisfactory results, while bidirectional inference can further improve consistency.
After running the inference code, the result folder
--base_dir
will be organized as follows:demo_outputs/dpt_init/000423/ ββββ result.txt βββ initial/ βββ color/ βββ gray/ βββ 1/ βββ color/ βββ gray/ βββ 2/ βββ color/ βββ gray/ βββ mix/ βββ color/ βββ gray/
result.txt
contains the OPW evaluations of initial depth (depth predictor,initial/
), NVDS forward predictions (1/
), backward predictions (2/
), and final bidirectional results (mix/
).color
contains depth visualizations andgray
contains depth results in uint16 format (0-65535). -
Video Comparisons.
After getting the results, video comparisons can be generated and saved in
demo_outputs_videos/
:python pic2v.py --vnum 000423 --infer_w 896 --infer_h 384 python pic2v.py --vnum market_6 --infer_w 896 --infer_h 384 python pic2v.py --vnum motocross-jump --infer_w 672 --infer_h 384
We show 8 video comparisons in
demo_outputs_videos/
. The first row is RGB video, the second row is initial depth (DPT and MiDaS), and the third row is NVDS results with DPT and MiDaS as depth predictors. To ensure the correctness of your running results, you can compare the results you obtained withdemo_outputs_videos
anddemo_outputs
(png results). We show png results of the 8 videos by LINK. Besides, you are also encouraged to modify our code to stabilize your own depth predictors and discuss the results with us. We hope our work can serve as a solid baseline for future works in video depth estimation and other relevant tasks.
π Evaluations on NYUDV2
-
Preparing 654 testing sequences.
Download the 654 testing sequences from LINK. Put the sequences in the
./test_nyu_data
folder. The./test_nyu_data
folder should only contain the 654 folders of all testing sequences. The folder of each sequence is organized by:test_nyu_data/1/ βββ rgb/ βββ 000000.png 000001.png 000002.png 000003.png βββ gt/ βββ 000003.png
We follow the commonly-applied Eigen split with 654 images for testing. In our case, we locate each image
(000003.png)
in its video and use its previous three frames(000000.png, 000001.png, and 000002.png)
as reference frames. -
Preparing NVDS checkpoint finetuned on NYUDV2.
Download and put the
NVDS_Stabilizer_NYUDV2_Finetuned.pth
inNVDS_checkpoints/
folder. -
Evaluations with Midas and DPT as different depth predictors.
Run
test_NYU_depth_metrics.py
with specified depth predictors (--initial_type dpt
ormidas
).CUDA_VISIBLE_DEVICES=0 python test_NYU_depth_metrics.py --initial_type dpt CUDA_VISIBLE_DEVICES=1 python test_NYU_depth_metrics.py --initial_type midas
The
test_NYU_depth_metrics.py
contains three parts: (1) Inference of depth predictors, producing initial results of Midas or DPT; (2) Inference of NVDS based on the initial results; (3) Metric evaluations of depth predictor and NVDS. All inference processes are conducted by the resolution of $384\times384$ as Midas and DPT. For simplicity, we only adopt NVDS forward prediction in this code. By running the code, you can reproduce similar results as our paper:Methods $\delta_1$ $Rel$ Methods $\delta_1$ $Rel$ Midas 0.910 0.095 DPT $0.928$ $0.084$ NVDS (Midas) 0.941 0.076 NVDS (DPT) 0.950 0.072 After running the evaluation code, the
test_nyu_data
will be organized by:test_nyu_data/1/ βββ rgb/ βββ 000000.png 000001.png 000002.png 000003.png βββ gt/ βββ 000003.png βββ initial_midas/ βββ 000000.png 000001.png 000002.png 000003.png βββ initial_dpt/ βββ 000000.png 000001.png 000002.png 000003.png βββ NVDS_midas/ βββ 000003.png βββ NVDS_dpt/ βββ 000003.png
We evaluate depth metrics of all methods only using the 654 images in Eigen split, i.e.,
000003.png
of each sequence.000000.png, 000001.png, and 000002.png
are produced by depth predictors as the input of the stabilization network.
π― Evaluations on VDW Test Set
-
Applying for the VDW test set.
Here we take
/xxx/vdw_test
as an example. The VDW test set contains 90 videos with 12,622 frames. For each video (e.g.,/xxx/vdw_test/000008/
), the test set is organized as follows. Theleft
orright
folders contain the RGB video frames of left and right views, while gt folders are for disparity annotations and mask folders for valid masks./xxx/vdw_test/000008/ βββ left/ βββ frame_000000.png frame_000001.png frame_000002.png ... βββ left_gt/ βββ frame_000000.png frame_000001.png frame_000002.png ... βββ left_mask/ βββ frame_000000.png frame_000001.png frame_000002.png ... βββ right/ βββ frame_000000.png frame_000001.png frame_000002.png ... βββ right_gt/ βββ frame_000000.png frame_000001.png frame_000002.png ... βββ right_mask/ βββ frame_000000.png frame_000001.png frame_000002.png ...
-
Inference and evaluations for each test video.
For each test video, the evaluations contain two steps: (1) inference; and (2) depth metrics evaluations. We provide the
write_sh.py
to generate evaluation scripts (for Midas and DPT). You should modify some details inwrite_sh.py
(e.g., gpu number, VDW test set path, directory for saving NVDS results with Midas/DPT) in order to generate thetest_VDW_NVDS_Midas.sh
andtest_VDW_NVDS_DPT.sh
. We provide the two example scripts with/xxx/
for those directories.To be specific, (1) the inference step is the same as the previous
Demo & Inference
part withinfer_NVDS_dpt_bi.py
andinfer_NVDS_midas_bi.py
. In this step, the temporal metricOPW
is automatically evaluated and saved in theresult.txt
. (2) Depth metrics evaluations utilize thevdw_test_metric.py
to calculate $\delta_1$ and $Rel$ for each video. Taking./vdw_test/000008/
as an example,--gt_dir
specifies the path forvdw_test
,--result_dir
specifies your directory for saving results, and--vnum
represents the video number.python vdw_test_metric.py --gt_dir /xxx/vdw_test/ --result_dir /xxx/NVDS_VDW_Test/Midas/ --vnum 000008 python vdw_test_metric.py --gt_dir /xxx/vdw_test/ --result_dir /xxx/NVDS_VDW_Test/DPT/ --vnum 000008
After generating
test_VDW_NVDS_Midas.sh
andtest_VDW_NVDS_DPT.sh
, you can run inference and evaluations for all the videos by:bash test_VDW_NVDS_Midas.sh bash test_VDW_NVDS_DPT.sh
-
Average metrics calculations for all 90 videos.
When the scripts are finished for all videos,
NVDS_VDW_Test
folder will contain the results of 90 test videos with Midas/DPT as depth predictors (/xxx/NVDS_VDW_Test/Midas/
and/xxx/NVDS_VDW_Test/DPT/
). For each video, there will be anaccuracy.txt
to store the depth metrics. The last step is to calculate the average temporal and depth metrics for all the 90 videos. You can simply run thecal_mean_vdw_metric.py
for the final results.python cal_mean_vdw_metric --test_dir /xxx/NVDS_VDW_Test/Midas/ python cal_mean_vdw_metric --test_dir /xxx/NVDS_VDW_Test/DPT/
Finally, you can get the same results as our paper. This also serves as an example to conduct evaluations on the VDW test set.
Methods $\delta_1$ $Rel$ $OPW$ Methods $\delta_1$ $Rel$ $OPW$ Midas 0.651 0.288 0.676 DPT 0.730 0.215 0.470 NVDS-Forward (Midas) 0.700 0.240 0.207 NVDS-Forward (DPT) 0.741 0.208 0.165 NVDS-Backward (Midas) 0.699 0.240 0.218 NVDS-Backward (DPT) 0.741 0.208 0.174 NVDS-Final (Midas) 0.700 0.240 0.180 NVDS-Final (DPT) 0.742 0.208 0.147
π» Star History
π Acknowledgement
We thank the authors for releasing PyTorch, MiDaS, DPT, GMFlow, SegFormer, VSS-CFFM, Mask2Former, PySceneDetect, and FFmpeg. Thanks for their solid contributions and cheers to the community.
π§ Citation
@InProceedings{NVDS,
author = {Wang, Yiran and Shi, Min and Li, Jiaqi and Huang, Zihao and Cao, Zhiguo and Zhang, Jianming and Xian, Ke and Lin, Guosheng},
title = {Neural Video Depth Stabilizer},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {9466-9476}}
@ARTICLE{NVDSPLUS,
author={Wang, Yiran and Shi, Min and Li, Jiaqi and Hong, Chaoyi and Huang, Zihao and Peng, Juewen and Cao, Zhiguo and Zhang, Jianming and Xian, Ke and Lin, Guosheng},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={NVDS$^{\mathbf{+}}$: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation},
year={2024},
pages={1-18},
doi={10.1109/TPAMI.2024.3476387}}