Home

Awesome

<div align="center"> <h1>PatchFusion </h1> <h3>An End-to-End Tile-Based Framework <br> for High-Resolution Monocular Metric Depth Estimation</h3>

Website Paper Hugging Face Space Hugging Face Model License: MIT

<a href="https://zhyever.github.io/">Zhenyu Li</a>, <a href="https://shariqfarooq123.github.io/">Shariq Farooq Bhat</a>, <a href="https://peterwonka.net/">Peter Wonka</a>. <br>KAUST

<center> <img src='assets/showcase_3.gif'> </center> </div>

NEWS

Repo Features

Environment setup

Install environment using environment.yml :

Using mamba (fastest):

mamba env create -n patchfusion --file environment.yml
mamba activate patchfusion

Using conda :

conda env create -n patchfusion --file environment.yml
conda activate patchfusion

NOTE:

Before running the code, please first run:

export PYTHONPATH="${PYTHONPATH}:/path/to/the/folder/PatchFusion"
export PYTHONPATH="${PYTHONPATH}:/path/to/the/folder/PatchFusion/external"

Make sure that you have exported the external folder which stores codes from other repos (ZoeDepth, Depth-Anything, etc.)

Pre-Train Model

We provide PatchFusion with various base depth models: ZoeDepth-N, Depth-Anything-vits, Depth-Anything-vitb, and Depth-Anything-vitl. The inference time of PatchFusion is linearly related to the base model's inference time.

from estimator.models.patchfusion import PatchFusion
model_name = 'Zhyever/patchfusion_depth_anything_vitl14'

# valid model name:
# 'Zhyever/patchfusion_depth_anything_vits14', 
# 'Zhyever/patchfusion_depth_anything_vitb14', 
# 'Zhyever/patchfusion_depth_anything_vitl14', 
# 'Zhyever/patchfusion_zoedepth'

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = PatchFusion.from_pretrained(model_name).to(DEVICE).eval()

Without Network Connection Solution

<details> <summary>Click here for solutions</summary>
model.config.pretrain_model=['./work_dir/depth-anything/ckps/coarse_pretrain.pth', './work_dir/depth-anything/ckps/fine_pretrain.pth']

# Note the default path would be: './work_dir/depthanything_vitl_u4k/coarse_pretrain/checkpoint_24.pth', './work_dir/depthanything_vitl_u4k/fine_pretrain/checkpoint_24.pth'. Just look for this item replace it correspondingly.
from mmengine.config import Config
cfg_path = './configs/patchfusion_depthanything/depthanything_vitl_patchfusion_u4k.py'
cfg = Config.fromfile(cfg_path) # load corresponding config for depth-anything vitl.
model = build_model(cfg.model) # build the model 
print(model.load_dict(torch.load(cfg.ckp_path)['model_state_dict']), logger='current') # load checkpoint

When building the PatchFusion model, it will load the coarse and fine checkpoints in the init function. Because the patchfusion.pth only contains the parameters of the fusion network, there will be some warnings here. But it's totally fine. The idea is to save coarse model, fine model, and fusion model separately.

Model NameConfig Path
Depth-Anything-vitl./configs/patchfusion_depthanything/depthanything_vitl_patchfusion_u4k.py
Depth-Anything-vitb./configs/patchfusion_depthanything/depthanything_vitb_patchfusion_u4k.py
Depth-Anything-vits./configs/patchfusion_depthanything/depthanything_vits_patchfusion_u4k.py
ZoeDepth-N./configs/patchfusion_zoedepth/zoedepth_patchfusion_u4k.py
</details>

User Inference

Running:

To execute user inference, use the following command:

python run.py ${CONFIG_FILE} --ckp-path <checkpoints> --cai-mode <m1 | m2 | rn> --cfg-option general_dataloader.dataset.rgb_image_dir='<img-directory>' [--save] --work-dir <output-path> --test-type general [--gray-scale] --image-raw-shape [h w] --patch-split-num [h, w]

Arguments Explanation (More details can be found here):

Example Usage:

Below is an example command that demonstrates how to run the inference process:

python ./tools/test.py configs/patchfusion_depthanything/depthanything_general.py --ckp-path Zhyever/patchfusion_depth_anything_vitl14 --cai-mode r32 --cfg-option general_dataloader.dataset.rgb_image_dir='./examples/' --save --work-dir ./work_dir/predictions --test-type general --image-raw-shape 1080 1920 --patch-split-num 2 2

This example performs inference using the depthanything_general.py configuration for Depth-Anything, loads the specified checkpoint patchfusion_depth_anything_vitl14, sets the PatchFusion mode to r32, specifies the input image directory ./examples/, and saves the output to ./work_dir/predictions ./work_dir/predictions. The original dimensions of the input image is 1080x1920 and the input image is divided into 2x2 patches.

Easy Way to Import PatchFusion:

<details> <summary>Code snippet</summary>

You can find this code snippet in ./tools/test_single_forward.py.

import cv2
import torch
import numpy as np
import torch.nn.functional as F
from torchvision import transforms

from estimator.models.patchfusion import PatchFusion

model_name = 'Zhyever/patchfusion_depth_anything_vitl14'

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
model = PatchFusion.from_pretrained(model_name).to(DEVICE).eval()
image_raw_shape = model.tile_cfg['image_raw_shape']
image_resizer = model.resizer

image = cv2.imread('./examples/example_1.jpeg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) / 255.0
image = transforms.ToTensor()(np.asarray(image)) # raw image

image_lr = image_resizer(image.unsqueeze(dim=0)).float().to(DEVICE)
image_hr = F.interpolate(image.unsqueeze(dim=0), image_raw_shape, mode='bicubic', align_corners=True).float().to(DEVICE)

mode = 'r128' # inference mode
process_num = 4 # batch process size
depth_prediction, _ = model(mode='infer', cai_mode=mode, process_num=process_num, image_lr=image_lr, image_hr=image_hr)
depth_prediction = F.interpolate(depth_prediction, image.shape[-2:])[0, 0].detach().cpu().numpy() # depth shape would be (h, w), similar to the input image
</details>

More introductions about inference are provided here.

User Training

Please refer to user_training for more details.

Acknowledgement

We would like to thank AK(@_akhaliq) and @hysts from the HuggingFace team for the help.

Citation

If you find our work useful for your research, please consider citing the paper

@article{li2023patchfusion,
    title={PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation}, 
    author={Zhenyu Li and Shariq Farooq Bhat and Peter Wonka},
    booktitle={CVPR},
    year={2024}
}