Home

Awesome

<h1 align="center"> Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail (ArXiv) </h1> <br>

:rotating_light: This repository will contain download links to our code, and trained deep stereo models of our work "Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail", ArXiv

by Luca Bartolomei<sup>1,2</sup>, Fabio Tosi<sup>2</sup>, Matteo Poggi<sup>1,2</sup>, and Stefano Mattoccia<sup>1,2</sup>

Advanced Research Center on Electronic System (ARCES)<sup>1</sup> University of Bologna<sup>2</sup>

<div class="alert alert-info"> <h2 align="center">

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail (ArXiv)<br>

Project Page | Paper

</h2> <img src="./images/teaser.png" alt="Alt text" style="width: 800px;" title="architecture"> <p style="text-align: justify;"><strong>Stereo Anywhere: Combining Monocular and Stereo Strenghts for Robust Depth Estimation.</strong> Our model achieves accurate results on standard conditions (on Middlebury), while effectively handling non-Lambertian surfaces where stereo networks fail (on Booster) and perspective illusions that deceive monocular depth foundation models (on MonoTrap, our novel dataset).</p>

Note: 🚧 Kindly note that this repository is currently in the development phase. We are actively working to add and refine features and documentation. We apologize for any inconvenience caused by incomplete or missing elements and appreciate your patience as we work towards completion.

:bookmark_tabs: Table of Contents

</div>

:clapper: Introduction

We introduce Stereo Anywhere, a novel stereo-matching framework that combines geometric constraints with robust priors from monocular depth Vision Foundation Models (VFMs). By elegantly coupling these complementary worlds through a dual-branch architecture, we seamlessly integrate stereo matching with learned contextual cues. Following this design, our framework introduces novel cost volume fusion mechanisms that effectively handle critical challenges such as textureless regions, occlusions, and non-Lambertian surfaces. Through our novel optical illusion dataset, MonoTrap, and extensive evaluation across multiple benchmarks, we demonstrate that our synthetic-only trained model achieves state-of-the-art results in zero-shot generalization, significantly outperforming existing solutions while showing remarkable robustness to challenging cases such as mirrors and transparencies.

Contributions:

:fountain_pen: If you find this code useful in your research, please cite:

@article{bartolomei2024stereo,
  title={Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail},
  author={Bartolomei, Luca and Tosi, Fabio and Poggi, Matteo and Mattoccia, Stefano},
  journal={arXiv preprint arXiv:2412.04472},
  year={2024},
}

:inbox_tray: Pretrained Models

Here, you will be able to download the weights of our proposal trained on Sceneflow.

The download link will be released soon.

:memo: Code

Details about training and testing scripts will be released soon.

:floppy_disk: Datasets

We used Sceneflow dataset for training and eight datasets for evaluation.

Specifically, we evaluate our proposal and competitors using:

Details about datasets will be released soon.

:train2: Training

We will provide futher information to train Stereo Anywhere soon.

:rocket: Test

We will provide futher information to test Stereo Anywhere soon.

:art: Qualitative Results

In this section, we present illustrative examples that demonstrate the effectiveness of our proposal.

<br> <p float="left"> <img src="./images/qualitative1.png" width="800" /> </p>

Qualitative Results -- Zero-Shot Generalization. Predictions by state-of-the-art models and Stereo Anywhere. In particular the first row shows an extremely challenging case for SceneFlow-trained models, where Stereo Anywhere achieves accurate disparity maps thanks to VFM priors.

<br> <p float="left"> <img src="./images/qualitative2.png" width="800" /> </p>

Qualitative results -- Zero-Shot non-Lambertian Generalization. Predictions by state-of-the-art models and Stereo Anywhere. Our proposal is the only stereo model correctly perceiving the mirror and transparent railing.

<br> <p float="left"> <img src="./images/qualitative3.png" width="800" /> </p>

Qualitative results -- MonoTrap. The figure shows three samples where Depth Anything v2 fails while Stereo Anywhere does not.

:envelope: Contacts

For questions, please send an email to luca.bartolomei5@unibo.it

:pray: Acknowledgements

We would like to extend our sincere appreciation to the authors of the following projects for making their code available, which we have utilized in our work: