Home

Awesome

<img width="1120" alt="predictions" src="https://user-images.githubusercontent.com/42019759/189385051-382d18db-1a98-4ee7-8295-0c73782e8260.png"> <img width="1120" alt="predictions2" src="https://user-images.githubusercontent.com/42019759/189411569-730f1980-3d63-4b4a-b446-1f6df209e4cc.png">

ChiTransformer: Towards Reliable Stereo from Cues [CVPR 2022]

Paper:

https://user-images.githubusercontent.com/42019759/187740463-c49d9625-453b-4e9e-88d2-f06342765f1b.mp4

Monocular estimators are free from most of the ill-posedness faced by matching-based multi-view depth estimator. However, ,monocular depth estimators can not provide reliable depth prediction due to the lack of epipolar constraint.

ChiTransformer is the first cue-based binocular depth estimator that leverages the strengths of both methods by introducing the Depth-Cue-Rectification (DCR) module to rectify the depth cues of two views in a cross-attention fashion under the underlying epipolar constraints that can be learned in DCR module.

https://user-images.githubusercontent.com/42019759/187740535-17e2d477-b751-49fc-b356-51d6acfb4432.mp4

In the video above, Chitransformer is compared with the state-of-the-art monocular depth estimator, DPT-hybrid, to show the improvement in reliability, i.f.o object-depth consistency and relative-position-depth consistency. Visually significant improvements can be found where,

Changelog

Setup

  1. Download the model weights and place them in the weights folder:
  1. Set up dependencies:

    pip install -r requirements.txt
    
    • The code was tested with Python 3.9, PyTorch 1.11.0, OpenCV 4.5.5, and timm 0.5.4

    • timm of later version may not work due to function definition change.

    • GPU memory should be no smaller than 24GB.

Usage

  1. Place one or more input images in the folder input.

  2. Run a stereo depth estimation model:

    python run_chitransformer.py -i ./inputs -o ./outputs -m [weight path] --kitti_crop --absolute_depth 
    
  3. The results are written to the folder output.

  torchrun --nproc_per_node=8 main.py --load_weight MODEL_PATH --data_path DATA_PATH --png --stereo --edge_smoothness --split SPLIT_TYPE --img_scales 0 --dcr_mode sp --rectilinear_epipolar_geometry [--freeze_embedder] [--only_dcr] [--train_refinenet] --epochs EPOCHS --lr_drop LR_DROP_POINT --learning_rate 0.00001 [--invert] [--crop]

Optional args should be set accordingly to achieve better performance. The current training pipeline is for (352, 1216) input. For input image of other sizes, you need to reconfigure accordingly. For more training options, please refer to $configs.py$.

Training tips:

if args.dataset == "kitti":
        args.max_depth = 80.0
        args.min_depth = 1e-3
        
        if args.edge_smoothness:
            args.smoothness_weight = 0.1
            
        if args.dcr_mode in ["sp", "spectrum"]:
            weight_dict = {
                "reprojection_loss": 1.5,
                "orthog_reg": 0.1, 
                "hoyer_reg": 1e-3,
                "fp_loss" : 5e-5, 
                           }
            losses = [
                "reprojection_loss", 
                "orthog_reg", 
                "hoyer_reg", 
                "fp_loss",  
                    ]

More qualitative comparisons with DPT

https://user-images.githubusercontent.com/42019759/187740352-7623a76b-3882-44db-8868-a6838c09d76e.mp4

<p align="left"> <img src="https://user-images.githubusercontent.com/42019759/186490687-28468fac-4fbd-4a66-a421-cb3bcc17b5cf.png" width="740"> </p> <p align="left"> <img src="https://user-images.githubusercontent.com/42019759/186490733-446cd8e5-7f92-44ae-8009-59c26291ac8a.png" width="740"> </p> <p align="left"> Checkout the prediction on the whole sequence of KITTI_09_26_93 at: [https://youtu.be/WULXAFbuRqw] </p> <p align="left"> <img src="https://user-images.githubusercontent.com/42019759/186490776-09e8e8c8-e130-4088-9280-aee5236fc763.png" width="740"> </p> <p align="left"> Checkout the prediction on the whole sequence of KITTI_10_03_34 at:[https://youtu.be/evif-Z8odYQ] </p>

[More videos]

Citation

Please cite this paper if you find the paper or code is useful.

@inproceedings{su2022chitransformer,
  title={ChiTransformer: Towards Reliable Stereo from Cues},
  author={Su, Qing and Ji, Shihao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={1939--1949},
  year={2022}
}

Acknowledgements

The work builds on and uses code from DPT, Monodepth2, timm and PyTorch-Encoding. We'd like to thank the authors for making these libraries available.

License

MIT License