Improving Zero-shot Generalization and Robustness of Multi-modal Models <br> Yunhao Ge*, Jie Ren*, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao ( * =equal contribution) <br> IEEE/ CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2023

Project Page | Video | Paper

<div align="center"> <img src="./docs/Fig-1.png" alt="Editor" width="1500"> </div>

Figure: Our zero-shot classification pipeline consists of 2 steps: confidence estimation via self-consistency (left block) and top-down and bottom-up label augmentation using the WordNet hierarchy (right block).

<div align="center"> <img src="./docs/Fig-2.png" alt="Editor" width="1500"> </div>

Figure: Typical failure modes in the cases where top-5 prediction was correct but top-1 was wrong.

Getting Started


git clone https://github.com/gyhandy/Hierarchy-CLIP.git
cd Hierarchy-CLIP
git clone https://github.com/google-research/scenic.git
cd scenic
pip install .

