Awesome
DoubleTake: Geometry Guided Depth Estimation
This is the reference PyTorch implementation for training and testing MVS depth estimation models using the method described in
<p align="center"> <img src="media/teaser.png" alt="example output" width="720" /> </p>DoubleTake: Geometry Guided Depth Estimation
Mohamed Sayed, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Guillermo Garcia-Hernando, Gabriel Brostow, Sara Vicente and Michael Firman.
Paper, ECCV 2024 (arXiv pdf), Supplemental Material, Project Page, Video
https://github.com/user-attachments/assets/aa2052df-79f4-43a8-ab24-d704660f228a
<p align="center"> <img src="media/mesh_teaser.png" alt="example output" width="720" /> </p>Please, refer to the the license file for terms of usage. If you use this codebase in your research, please consider citing our paper using the BibTex below and linking this repo. Thanks!
Table of Contents
- ๐บ๏ธ Overview
- โ๏ธ Setup
- ๐ฆ Trained Models and Precomputed Meshes/Scores
- ๐ Speed
- ๐ Running out of the box!
- ๐พ ScanNetv2 Dataset
- ๐พ SimpleRecon ScanNet Training Depth Renders
- ๐พ 3RScan Dataset
- ๐ Testing and Evaluation
- ๐ Mesh Metrics
- ๐๐งฎ๐ฉโ๐ป Notation for Transformation Matrices
- ๐บ๏ธ World Coordinate System
- ๐จ๐พ Training Data Preperation
- ๐ Acknowledgements
- ๐ BibTeX
- ๐ฉโโ๏ธ License
๐บ๏ธ Overview
DoubleTake takes as input posed RGB images, and outputs a depth map for a target image. Under the hood, it uses a mesh it itself builds either online (incrementally) or offline (mesh built on one pass and used for better depth on the second pass) to improve its own depth estimates.
https://github.com/user-attachments/assets/269c658a-7325-4b52-98ab-bd3505f045db
โ๏ธ Setup
We are going to create a new Mamba environment called doubletake
. If you don't have Mamba, you can install it with:
make install-mamba
Then setup the environment with:
make create-mamba-env
mamba activate doubletake
In the code directory, install the repo as a pip package:
pip install -e .
Some C++ code will compile JIT using ninja the first time you use any of the fusers. Should be quick.
In case you don't have this in your ~/.bashrc
already, you should run:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
If you get a GLIBCXX_3.4.29
not found error it's very likely this.
๐ฆ Trained Models and Precomputed Meshes/Scores
We provide three models. The standard DoubleTake model used for incremental, offline, and revisit evaluation on all datasets and figures in the paper, a slimmed down faster version of DoubleTake, and the vanilla SimpleRecon model we used for SimpleRecon scores. Use the links in the table to access the weights for each. The scores here are very slightly different (better) than those in the paper due to a slight bug fix in training data renders.
Download a pretrained model into the weights/
folder.
Scores on ScanNet:
Model | Config | Weights | Notes |
---|---|---|---|
SimpleRecon | configs/models/simplerecon_model.yaml | Link | |
DoubleTake Small | configs/models/doubletake_small_model.yaml | Link | |
DoubleTake | configs/models/doubletake_model.yaml | Link | ours in the paper |
Offline/Two Pass using test_offline_two_pass | Abs Diffโ | Sq Relโ | delta < 1.05โ | Chamferโ | F-Scoreโ | Meshes and Full Scores |
---|---|---|---|---|---|---|
SimpleRecon (Offline Tuples w/ test_no_hint ) | .0873 | .0128 | 74.12 | 5.29 | .668 | Link |
DoubleTake Small | .0631 | .0097 | 86.36 | 4.64 | .723 | Link |
DoubleTake | .0624 | .0092 | 86.64 | 4.42 | .742 | Link |
Incremental using test_incremental | Abs Diffโ | Sq Relโ | delta < 1.05โ | Chamferโ | F-Scoreโ | Meshes and Full Scores |
---|---|---|---|---|---|---|
DoubleTake Small | .0825 | .0124 | 76.75 | 5.53 | .649 | Link |
DoubleTake | .0754 | .0109 | 80.29 | 5.03 | .689 | Link |
No hint and online using test_no_hint | Abs Diffโ | Sq Relโ | delta < 1.05โ | Chamferโ | F-Scoreโ | Meshes and Full Scores |
---|---|---|---|---|---|---|
SimpleRecon (Online Tuples) | .0873 | .0128 | 74.12 | 5.29 | .668 | Link |
DoubleTake Small | .0938 | .0148 | 72.02 | 5.50 | .650 | Link |
DoubleTake | .0863 | .0127 | 74.64 | 5.22 | .672 | Link |
๐ Speed
Please see the paper and supplemental material for details on runtime. We do not include the first-pass feature caching step in this code release.
๐ Running out of the box!
We've included two scans for people to try out immediately with the code. You can download these scans from here.
Steps:
- Download weights for the
hero_model
into the weights directory. - Download the scans and unzip them into
datasets/
- If you've unzipped into a different folder, modify the value for the option
dataset_path
inconfigs/data/vdr/vdr_default_offline.yaml
to the base path of the unzipped vdr folder. - You should be able to run it! Something like this will work:
For offline depth estimation and fusion:
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_offline_two_pass --name doubletake_offline \
--output_base_path $OUTPUT_PATH \
--config_file configs/models/doubletake_model.yaml \
--load_weights_from_checkpoint weights/doubletake_model.ckpt \
--data_config configs/data/vdr/vdr_default_offline.yaml \
--num_workers 8 \
--batch_size 2 \
--fast_cost_volume \
--run_fusion \
--depth_fuser custom_open3d \
--fuse_color \
--fusion_max_depth 3.5 \
--fusion_resolution 0.02 \
--trim_tsdf_using_confience \
--extended_neg_truncation \
--dump_depth_visualization;
This will output meshes, quick depth viz, and scores when benchmarked against LiDAR depth under OUTPUT_PATH
.
This command uses vdr_default_offline.yaml
which will generate a depth map for every keyframe and fuse them into a mesh. You can also use dense_offline
tuples by instead using vdr_dense_offline.yaml
for a depth map for every frame.
See the section below on testing and evaluation. Make sure to use the correct config flags for datasets.
๐พ ScanNetv2 Dataset
We've written a quick tutorial and included modified scripts to help you with downloading and extracting ScanNetv2. You can find them at data_scripts/scannet_wrangling_scripts/
You should change the dataset_path
config argument for ScanNetv2 data configs at configs/data/
to match where your dataset is.
The codebase expects ScanNetv2 to be in the following format:
dataset_path
scans_test (test scans)
scene0707
scene0707_00_vh_clean_2.ply (gt mesh)
sensor_data
frame-000261.pose.txt
frame-000261.color.jpg
frame-000261.color.512.png (optional, image at 512x384)
frame-000261.color.640.png (optional, image at 640x480)
frame-000261.depth.png (full res depth, stored scale *1000)
frame-000261.depth.256.png (optional, depth at 256x192 also
scaled)
scene0707.txt (scan metadata and image sizes)
intrinsic
intrinsic_depth.txt
intrinsic_color.txt
...
scans (val and train scans)
scene0000_00
(see above)
scene0000_01
....
In this example scene0707.txt
should contain the scan's metadata:
colorHeight = 968
colorToDepthExtrinsics = 0.999263 -0.010031 0.037048 ........
colorWidth = 1296
depthHeight = 480
depthWidth = 640
fx_color = 1170.187988
fx_depth = 570.924255
fy_color = 1170.187988
fy_depth = 570.924316
mx_color = 647.750000
mx_depth = 319.500000
my_color = 483.750000
my_depth = 239.500000
numColorFrames = 784
numDepthFrames = 784
numIMUmeasurements = 1632
frame-000261.pose.txt
should contain pose in the form:
-0.384739 0.271466 -0.882203 4.98152
0.921157 0.0521417 -0.385682 1.46821
-0.0587002 -0.961035 -0.270124 1.51837
frame-000261.color.512.png
and frame-000261.color.640.png
are precached resized versions of the original image to save load and compute time during training and testing. frame-000261.depth.256.png
is also a
precached resized version of the depth map.
All resized precached versions of depth and images are nice to have but not required. If they don't exist, the full resolution versions will be loaded, and downsampled on the fly.
๐พ SimpleRecon ScanNet Training Depth Renders
DoubleTake is trained using depth and confidence renders of partial and full meshes of the train and validation ScanNet splits. We've provided these here if you'd like to train a DoubleTake model.
๐พ 3RScan Dataset
This section explains how to prepare 3RScan for testing:
Please download and extract the dataset by following the instructions here.
The dataset should be formatted like so:
<dataset_path>
<scanId>
|-- mesh.refined.v2.obj
Reconstructed mesh
|-- mesh.refined.mtl
Corresponding material file
|-- mesh.refined_0.png
Corresponding mesh texture
|-- sequence.zip
Calibrated RGB-D sensor stream with color and depth frames, camera poses
|-- labels.instances.annotated.v2.ply
Visualization of semantic segmentation
|-- mesh.refined.0.010000.segs.v2.json
Over-segmentation of annotation mesh
|-- semseg.v2.json
Instance segmentation of the mesh (contains the labels)
Please make sure to extract each sequence.zip
inside every scanId
folder.
We provide the frame tuple files for this dataset (see for eg. data_splits/3rscan/test_eight_view_deepvmvs.txt
) but if you need recreate them, you can do so by following the instructions here.
NOTE: we only use 3RScan dataset for testing and the data split used (data_splits/3rscan/3rscan_test.txt
) corresponds to the validation split in the original dataset repo (splits/val.txt
). We use the val split as the transformations that align the reference scan to the rescans are readily available for the train and val splits.
๐ผ๏ธ๐ผ๏ธ๐ผ๏ธ Frame Tuples
By default, we estimate a depth map for each keyframe in a scan. We use DeepVideoMVS's heuristic for keyframe separation and construct tuples to match. We use the depth maps at these keyframes for depth fusion. For each keyframe, we associate a list of source frames that will be used to build the cost volume. We also use dense tuples, where we predict a depth map for each frame in the data, and not just at specific keyframes; these are mostly used for visualization.
We generate and export a list of tuples across all scans that act as the dataset's elements. We've precomputed these lists and they are available at data_splits
under each dataset's split. For ScanNet's test scans they are at data_splits/ScanNetv2/standard_split
. Our core depth numbers are computed using data_splits/ScanNetv2/standard_split/test_eight_view_deepvmvs.txt
.
Here's a quick taxonamy of the type of tuples for test:
default
: a tuple for every keyframe following DeepVideoMVS where all source frames are in the past. Used for all depth and mesh evaluation unless stated otherwise. For ScanNet usedata_splits/ScanNetv2/standard_split/test_eight_view_deepvmvs.txt
.offline
: a tuple for every frame in the scan where source frames can be both in the past and future relative to the current frame. These are useful when a scene is captured offline, and you want the best accuracy possible. With online tuples, the cost volume will contain empty regions as the camera moves away and all source frames lag behind; however with offline tuples, the cost volume is full on both ends, leading to a better scale (and metric) estimate.dense
: an online tuple (like default) for every frame in the scan where all source frames are in the past. For ScanNet this would bedata_splits/ScanNetv2/standard_split/test_eight_view_deepvmvs_dense.txt
.dense_offline
: an offline tuple for every frame in the scan.
For the train and validation sets, we follow the same tuple augmentation strategy as in DeepVideoMVS and use the same core generation script.
If you'd like to generate these tuples yourself, you can use the scripts at data_scripts/generate_train_tuples.py
for train tuples and data_scripts/generate_test_tuples.py
for test tuples. These follow the same config format as test.py
and will use whatever dataset class you build to read pose informaiton.
Example for test:
# default tuples
python ./scripts/data_scripts/generate_test_tuples.py
--data_config configs/data/scannet/scannet_default_test.yaml
--num_workers 16
# dense tuples
python ./scripts/data_scripts/generate_test_tuples.py
--data_config configs/data/scannet/scannet_dense_test.yaml
--num_workers 16
Examples for train:
# train
python ./scripts/data_scripts/generate_train_tuples.py
--data_config configs/data/scannet/scannet_default_train.yaml
--num_workers 16
# val
python ./scripts/data_scripts/generate_val_tuples.py
--data_config configs/data/scannet/scannet_default_val.yaml
--num_workers 16
These scripts will first check each frame in the dataset to make sure it has an existing RGB frame, an existing depth frame (if appropriate for the dataset), and also an existing and valid pose file. It will save these valid_frames
in a text file in each scan's folder, but if the directory is read only, it will ignore saving a valid_frames
file and generate tuples anyway.
๐ Testing and Evaluation
Depth Evaluation
You can evaluate our model on the depth benchmark of ScanNetv2 using the following commands:
For online incremental depth estimation, use this command.
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_incremental \
--name doubletake_incremental \
--config_file configs/models/doubletake_model.yaml \
--load_weights_from_checkpoint weights/doubletake_model.ckpt \
--data_config configs/data/scannet/scannet_default_test.yaml \
--num_workers 12 \
--batch_size 1 \
--fast_cost_volume \
--output_base_path $OUTPUT_DIR \
--load_empty_hint \
--fusion_resolution 0.02 \
--extended_neg_truncation \
--fusion_max_depth 3.5 \
--depth_fuser ours;
For offline depth estimation, use this command. Note this will generate meshes.
Remove --run_fusion
if you don't want to generate meshes for the second pass.
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_offline_two_pass \
--name doubletake_offline \
--config_file configs/models/doubletake_model.yaml \
--load_weights_from_checkpoint weights/doubletake_model.ckpt \
--data_config configs/data/scannet/scannet_offline_test.yaml \
--num_workers 12 \
--batch_size 4 \
--fast_cost_volume \
--output_base_path $OUTPUT_DIR \
--load_empty_hint \
--fusion_resolution 0.02 \
--extended_neg_truncation \
--fusion_max_depth 3.5 \
--depth_fuser ours;
If you want to see the performance of a DoubleTake model without hints (no depth hint and online), use:
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_no_hint \
--name doubletake_no_hint \
--config_file configs/models/doubletake_model.yaml \
--load_weights_from_checkpoint weights/doubletake_model.ckpt \
--data_config configs/data/scannet/scannet_default_test.yaml \
--num_workers 12 \
--batch_size 4 \
--fast_cost_volume \
--output_base_path $OUTPUT_DIR \
--load_empty_hint \
--fusion_resolution 0.02 \
--extended_neg_truncation \
--fusion_max_depth 3.5 \
--depth_fuser ours;
You can use test_no_hint
for the provided SimpleRecon model as well.
TSDF Fusion
TL;DR: use ours
for ScanNet, 7Scenes, and 3RScan. Anything to do with scores. Use custom_open3d
for anything else.
ours
and custom_open3d
give almost identical scores on ScanNet given the same fusion flags.
To run TSDF fusion provide the --run_fusion
flag. This is mandatory for incremental running. You have three choices for
fusers:
--depth_fuser ours
(default) will use our fuser, whose meshes are used in most visualizations and for scores. This fuser does not support color. We've provided a custom branch of scikit-image with our custom implementation ofmeasure.matching_cubes
that allows single walled. We use single walled meshes for evaluation. If this is isn't important to you, you can set the export_single_mesh toFalse
for call toexport_mesh
intest.py
. This fuser's TSDF volume is not sparse, and ScanNet/7Scenes/3RScan meshes will fit in memory on an A100 given we have known mesh bounds.--depth_fuser custom_open3d
will use a custom version of the open3d fuser that supports our confidence mapping and confidence sampling. There is currently a memory leak in open3d core. We will post an updated version of the fuser if resolved. This fuser supports a sparse volume and our free space cleanup. This fuser supports color via the--fuse_color
flag.--depth_fuser open3d
will use the default open3d depth fuser. This fuser supports color and you can enable this by using the--fuse_color
flag. This fuser does not support confidences. This fuser cannot be used as a hint fuser.
For ours
and custom_open3d
you can pass --extended_neg_truncation
for more complete meshes.
Scores in the paper are computed with this.
For custom_open3d
you can pass --trim_tsdf_using_confience
to remove potential floaters, especially in outdoor scenes.
By default, depth maps will be clipped to 3.5m for fusion and a tsdf
resolution of 0.02m<sup>3</sup> will be used, but you can change that by changing both
--max_fusion_depth
and --fusion_resolution
.
Hint fusers are locked to 3.0m and 0.04m resolution.
Meshes will be stored under results_path/meshes/{scan name}_{mesh params}
.
Cache depths
You can optionally store depths by providing the --cache_depths
flag.
They will be stored at results_path/depths
.
Example command to compute scores and cache depths
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_offline_two_pass \
--name doubletake_offline \
--config_file configs/models/doubletake_model.yaml \
--load_weights_from_checkpoint weights/doubletake_model.ckpt \
--data_config configs/data/scannet/scannet_offline_test.yaml \
--num_workers 12 \
--batch_size 4 \
--fast_cost_volume \
--output_base_path $OUTPUT_DIR \
--load_empty_hint \
--cache_depths;
Quick viz
There are other scripts for deeper visualizations of output depths and
fusion, but for quick export of depth map visualization you can use
--dump_depth_visualization
. Visualizations will be stored at results_path/viz/quick_viz/
.
# Example command to output quick depth visualizations
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_offline_two_pass \
--name doubletake_offline \
--config_file configs/models/doubletake_model.yaml \
--load_weights_from_checkpoint weights/doubletake_model.ckpt \
--data_config configs/data/scannet/scannet_offline_test.yaml \
--num_workers 12 \
--batch_size 4 \
--fast_cost_volume \
--output_base_path $OUTPUT_DIR \
--load_empty_hint \
--dump_depth_visualization;
Revisit Evaluation
You can evaluate our model in the revist scenario (i.e using the geometry from a previous visit as โhintsโ for our current depth estimates) on the 3RScan dataset by running the following command:
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_revisit \
--config_file configs/models/doubletake_model.yaml \
--load_weights_from_checkpoint ./models/doubletake_model.ckpt \
--data_config configs/data/3rscan/3rscan_test.yaml \
--dataset_path PATH/TO/3RScan_dataset \
--num_workers 12 \
--batch_size 6 \
--output_base_path ./outputs/ \
--depth_hint_aug 0.0 \
--load_empty_hint \
--name final_model_3rscan_revisit \
--run_fusion \
--rotate_images;
๐ Mesh Metrics
We use a mesh evaluation protocol similar to TransformerFusion's, but use occlusion masks that better fit available geometry in the ground truth. The masks can be found here.
CUDA_VISIBLE_DEVICES=0 python scripts/evals/mesh_eval.py \
--groundtruth_dir SCANNET_TEST_FOLDER_PATH \
--prediction_dir ROOT_PRED_DIRECTORY/SCAN_NAME.ply \
--visibility_volume_path UNTARED_VISIBILITY_MASK_PATH \
--wait_for_scan;
Use --wait_for_scan
if the prediction is still being generated and you want the script to wait until a scan's mesh is available before proceeding.
๐๐งฎ๐ฉโ๐ป Notation for Transformation Matrices
TL;DR: world_T_cam == world_from_cam
This repo uses the notation "cam_T_world" to denote a transformation from world to camera points (extrinsics). The intention is to make it so that the coordinate frame names would match on either side of the variable when used in multiplication from right to left:
cam_points = cam_T_world @ world_points
world_T_cam
denotes camera pose (from cam to world coords). ref_T_src
denotes a transformation from a source to a reference view.
Finally this notation allows for representing both rotations and translations such as: world_R_cam
and world_t_cam
๐บ๏ธ World Coordinate System
This repo is geared towards ScanNet, so while its functionality should allow for any coordinate system (signaled via input flags), the model weights we provide assume a ScanNet coordinate system. This is important since we include ray information as part of metadata. Other datasets used with these weights should be transformed to the ScanNet system. The dataset classes we include will perform the appropriate transforms.
๐จ๐พ Training Data Preperation
To train a DoubleTake model you'll need the ScanNetv2 dataset and renders of a mesh from an SR model. We provide these renders.
To generate mesh renders, you'll first need to run a SimpleRecon model and cache those depths to disk. You should
use scannet_default_train_inference_style.yaml
and scannet_default_val_inference_style.yaml
for this. These configs run the model on test-style keyframes
on both train and val splits. Something like this:
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_no_hint
--config_file configs/models/simplerecon_model.yaml
--load_weights_from_checkpoint simplerecon_model_weights.ckpt
--data_config configs/data/scannet/scannet_default_train_inference_style.yaml
--num_workers 8
--batch_size 8
--cache_depths
--run_fusion
--output_base_path YOUR_OUTPUT_DIR
--dataset_path SCANNET_DIR;
CUDA_VISIBLE_DEVICES=0 python -m doubletake.test_no_hint
--config_file configs/models/simplerecon_model.yaml
--load_weights_from_checkpoint simplerecon_model_weights.ckpt
--data_config configs/data/scannet/scannet_default_val_inference_style.yaml
--num_workers 8
--batch_size 8
--cache_depths
--run_fusion
--output_base_path YOUR_OUTPUT_DIR
--dataset_path SCANNET_DIR;
With these cached depths, you can generate mesh renders for training:
CUDA_VISIBLE_DEVICES=0 python ./scripts/render_scripts/render_meshes.py \
--data_config configs/data/scannet/scannet_default_train.yaml \
--cached_depth_path YOUR_OUTPUT_DIR/simplerecon_model/scannet/default/depths \
--output_root renders/partial_renders \
--dataset_path SCANNET_DIR \
--batch_size 4 \
--data_to_render both \
--partial 1;
CUDA_VISIBLE_DEVICES=0 python ./scripts/render_scripts/render_meshes.py \
--data_config configs/data/scannet/scannet_default_train.yaml \
--cached_depth_path YOUR_OUTPUT_DIR/simplerecon_model/scannet/default/depths \
--output_root renders/renders \
--dataset_path /mnt/scannet/ \
--batch_size 4 \
--data_to_render both \
--partial 0;
CUDA_VISIBLE_DEVICES=0 python ./scripts/render_scripts/render_meshes.py \
--data_config configs/data/scannet/scannet_default_val.yaml \
--cached_depth_path YOUR_OUTPUT_DIR/simplerecon_model/scannet/default/depths \
--output_root renders/partial_renders \
--dataset_path SCANNET_DIR \
--batch_size 4 \
--data_to_render both \
--partial 1;
CUDA_VISIBLE_DEVICES=0 python ./scripts/render_scripts/render_meshes.py \
--data_config configs/data/scannet/scannet_default_val.yaml \
--cached_depth_path YOUR_OUTPUT_DIR/simplerecon_model/scannet/default/depths \
--output_root renders/renders \
--dataset_path /mnt/scannet/ \
--batch_size 4 \
--data_to_render both \
--partial 0;
๐ Acknowledgements
The tuple generation scripts make heavy use of a modified version of DeepVideoMVS's Keyframe buffer (thanks Arda and co!).
We'd like to thank the Niantic Raptor R&D infrastructure team - Saki Shinoda, Jakub Powierza, and Stanimir Vichev - for their valuable infrastructure support.
๐ BibTeX
If you find our work useful in your research please consider citing our paper:
@inproceedings{sayed2022simplerecon,
title={DoubleTake: Geometry Guided Depth Estimation},
author={Sayed, Mohamed and Aleotti, Filippo and Watson, Jamie and Qureshi, Zawar and Garcia-Hernando, Guillermo and Brostow, Gabriel and Vicente, Sara and Firman, Michael},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024},
}
๐ฉโโ๏ธ License
Copyright ยฉ Niantic, Inc. 2024. Patent Pending. All rights reserved. Please see the license file for terms.