Awesome

CVinW Readings

``Computer Vision in the Wild (CVinW)'' is an emerging research field. This writeup provides a quick introduction of CVinW and maintains a collection of papers on the topic. If you find some missing papers or resources, please open issues or pull requests (recommended).

What is Computer Vision in the Wild (CVinW)?
Papers on Task-level Transfer with Pre-trained Models
Papers on Efficient Model Adaptation
- Parameter-Efficient Methods
- Others
Papers on Out-of-domain Generalization
Acknowledgements

What is Computer Vision in the Wild?

:star: Goals of CVinW

Developing a transferable foundation model/system that can effortlessly adapt to a large range of visual tasks in the wild. It comes with two key factors: (i) The task transfer scenarios are broad, and (ii) The task transfer cost is low. The main idea is illustrated as follows, please see the detailed description in ELEVATER paper.

:one: Task Transfer Scenarios are Broad

We illustrate and compare CVinW with other settings using a 2D chart in Figure 1, where the space is constructed with two orthogonal dimensions: input image distribution and output concept set. The 2D chart is divided into four quadrants, based on how the model evaluation stage is different from model development stage. For any visual recognition problems at different granularity such as image classification, object detection and segmentation, the modeling setup cann be categorized into one of the four settings. We see an emerging trend on moving towards CVinW. Interested in the various pre-trained vision models that move towards CVinW? please check out Section :fire:``Papers on Task-level Transfer with Pre-trained Models''.

<table> <tr> <td width="50%"> <ul> <li>The Close-Set Setting. Both training and evaluation distributions are consistent in both dimensions, a typical setting in ML/CV textbooks.</li> <li>Open-Set/Vocabulary/World Setting. It allows new concepts in evaluation, while typically remains the same visual domain. Please see examples in <a href='https://arxiv.org/abs/1707.00600'>image classification</a> and <a href='https://arxiv.org/abs/2011.10678'>object detection</a>. </li> <li>Domain Generalization Setting. Domain shift allows new visual domain in evaluation, while typically remains the same concept pool. Please see examples such as <a href='https://arxiv.org/abs/2007.01434'>DomainBed</a> and <a href='http://ai.bu.edu/M3SDA/'>DomainNet</a>. </li> <li style="background-color:powderblue;">Computer Vision in the Wild Setting. CVinW allows the flexibility in both dimensions, where any new tasks/datasets in the wild essentially fall into.</li> </ul> </td> <td> <img src="images/fig_cvinw.png" style="width:100%;"> </td> </tr> <tr> <th> A brief definition with a four-quadrant chart </th> <th>Figure 1: The comparison of CVinW with other existing settings</th> </tr> </table>

:two: Task Transfer Cost is Low

One major advantage of pre-trained models is the promise that they can transfer to downstream tasks effortlessly. The model adaptation cost is considered in two orthogonal dimensions: sample-efficiency and parameter-efficiency, as illustrated in Figure 2. The bottom-left corner and top-right corner is the most inexpensive and expensive adaptation strategy, respectively. One may interpolate and make combinations in the 2D space, to get different model adaptation methods with different cost. To efficient adapt large vision models of the gradaully increaseing size, we see an emerging need on efficient model adaptation. Interested in contributing your smart efficient adaptation algorithms and see how it differs from existing papers? please check out Section :snowflake:``Papers on Efficient Model Adaptation'' .

<table> <tr> <td width="50%"> <ul> <li>Sample-efficiency: Zero-, Few-, and Full-shot. Due to the high cost of annotating data, it is often desired to provide a small number of labeled image-label pairs in downstream datasets. Transferable models should be able to reach high performance in this data-limited scenario..</li> <li>Parameter-efficiency: Frozen Model Inference, Prompting Tuning, Linear Probing vs Full Model Fine-tuning.. A smaller number of trainable parameter in model adaptation typically means a small training cost in a new task. </li> </ul> </td> <td> <img src="images/fig_adapation_cost.png" style="width:100%;"> </td> </tr> <tr> <th> A breakdown definition of efficient model adaptation</th> <th>Figure 2: The 2D chart of model adaptation cost.</th> </tr> </table>

:cinema: Benchmarks

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models. Chunyuan Li*, Haotian Liu*, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, Jianfeng Gao. NeurIPS 2022 (Datasets and Benchmarks Track). <a href='https://arxiv.org/abs/2204.08790'>[paper]</a> <a href='https://computer-vision-in-the-wild.github.io/ELEVATER/'>[benchmark]</a>

:loudspeaker: News

[09/2023] 🔥 Discover the fascinating journey of "Multimodal Foundation Models: From Specialists to General-Purpose Assistants" 🌐 Dive into the evolution of large models in #ComputerVision & #VisionLanguage! This is based on our CVPR 2023 Tutorial, where you could find videos and slides of the core chapters. For its preceding paper, please check out Vision-Language Pre-training: Basics, Recent Advances, and Future Trends

[02/2023] Organizing the 2nd Workshop @ CVPR2023 on Computer Vision in the Wild (CVinW), where two new challenges are hosted to evaluate the zero-shot, few-shot and full-shot performance of pre-trained vision models in downstream tasks:
- ``Segmentation in the Wild (SGinW)'' Challenge evaluates on 25 image segmentation tasks.
- ``Roboflow 100 for Object Detection in the Wild'' Challenge evaluates on 100 object detection tasks.

$\qquad$ <img src="images/cvpr-2023-logo.jpeg" width=10%/> [Workshop] $\qquad$ <img src="images/sginw.jpg" width=10%/> [SGinW Challenge] $\qquad$ <img src="images/rf100.png" width=10%/> [RF100 Challenge]

[09/2022] Organizing ECCV Workshop Computer Vision in the Wild (CVinW), where two challenges are hosted to evaluate the zero-shot, few-shot and full-shot performance of pre-trained vision models in downstream tasks:
- ``Image Classification in the Wild (ICinW)'' Challenge evaluates on 20 image classification tasks.
- ``Object Detection in the Wild (ODinW)'' Challenge evaluates on 35 object detection tasks.

$\qquad$ <img src="images/eccv2022-logo.png" width=10%/> [Workshop] $\qquad$ <img src="images/icinw100.jpg" width=10%/> [ICinW Challenge] $\qquad$ <img src="images/odinw.jpg" width=10%/> [ODinW Challenge]

:fire: Papers on Task-level Transfer with Pre-trained Models

:orange_book: Image Classification in the Wild

[CLIP] Learning Transferable Visual Models From Natural Language Supervision. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. ICML 2021. <a href='https://arxiv.org/abs/2103.00020'>[paper]</a> <a href='https://github.com/OpenAI/CLIP'>[code]</a> [ALIGN] Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. ICML 2021. <a href='https://arxiv.org/abs/2102.05918'>[paper]</a> OpenCLIP. Gabriel Ilharco*, Mitchell Wortsman*, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, John Miller, Hongseok Namkoong, Hannaneh Hajishirzi, Ali Farhadi, Ludwig Schmidt. 10.5281/zenodo.5143773, 2021. <a href='https://github.com/mlfoundations/open_clip'>[code]</a> Florence: A New Foundation Model for Computer Vision. Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang. arXiv:2111.11432, 2022. <a href='https://arxiv.org/abs/2111.11432'>[paper]</a> [UniCL] Unified Contrastive Learning in Image-Text-Label Space. Jianwei Yang*, Chunyuan Li*, Pengchuan Zhang*, Bin Xiao*, Ce Liu, Lu Yuan, Jianfeng Gao. CVPR 2022. <a href='https://arxiv.org/abs/2204.03610'>[paper]</a> <a href='https://github.com/microsoft/UniCL'>[code]</a> LiT: Zero-Shot Transfer with Locked-image text Tuning. Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, Lucas Beyer. CVPR 2022. <a href='https://arxiv.org/abs/2111.07991'>[paper]</a> [DeCLIP] Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, Junjie Yan. ICLR 2022. <a href='https://arxiv.org/abs/2110.05208'>[paper]</a> <a href='https://github.com/Sense-GVT/DeCLIP'>[code]</a> FILIP: Fine-grained Interactive Language-Image Pre-Training. Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu. ICLR 2022. <a href='https://arxiv.org/abs/2111.07783'>[paper]</a> SLIP: Self-supervision meets Language-Image Pre-training. Norman Mu, Alexander Kirillov, David Wagner, Saining Xie. ECCV 2022. <a href='https://arxiv.org/abs/2112.12750'>[paper]</a> <a href='https://github.com/facebookresearch/SLIP'>[code]</a> [MS-CLIP]: Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training. Haoxuan You*, Luowei Zhou*, Bin Xiao*, Noel Codella*, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan. ECCV 2022. <a href='https://arxiv.org/abs/2207.12661'>[paper]</a> <a href='https://github.com/Hxyou/MSCLIP'>[code]</a> MultiMAE: Multi-modal Multi-task Masked Autoencoders. Roman Bachmann, David Mizrahi, Andrei Atanov, Amir Zamir. ECCV 2022. <a href='https://arxiv.org/abs/2204.01678'>[paper]</a> <a href='https://github.com/EPFL-VILAB/MultiMAE'>[code]</a> <a href='https://multimae.epfl.ch/'>[project]</a> [M3AE] Multimodal Masked Autoencoders Learn Transferable Representations. Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurmans, Sergey Levine, Pieter Abbeel. arXiv:2205.14204, 2022. <a href='https://arxiv.org/abs/2205.14204'>[paper]</a> <a href='https://github.com/young-geng/m3ae_public'>[code]</a> CyCLIP: Cyclic Contrastive Language-Image Pretraining. Shashank Goel, Hritik Bansal, Sumit Bhatia, Ryan A. Rossi, Vishwa Vinay, Aditya Grover. NeurIPS 2022. <a href='https://arxiv.org/abs/2205.14459'>[paper]</a> <a href='https://github.com/goel-shashank/CyCLIP'>[code]</a> PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining. Yuting Gao, Jinfeng Liu, Zihan Xu, Jun Zhang, Ke Li, Rongrong Ji, Chunhua Shen. NeurIPS 2022. <a href='https://arxiv.org/abs/2204.14095'>[paper]</a> K-LITE: Learning Transferable Visual Models with External Knowledge. Sheng Shen*, Chunyuan Li*, Xiaowei Hu*, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Jianfeng Gao. NeurIPS 2022. <a href='https://arxiv.org/abs/2204.09222'>[paper]</a> UniCLIP: Unified Framework for Contrastive Language-Image Pre-training. Janghyeon Lee, Jongsuk Kim, Hyounguk Shon, Bumsoo Kim, Seung Hwan Kim, Honglak Lee, Junmo Kim. NeurIPS 2022. <a href='https://arxiv.org/abs/2209.13430'>[paper]</a> CoCa: Contrastive Captioners are Image-Text Foundation Models. Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu. arXiv:2205.01917, 2022. <a href='https://arxiv.org/abs/2205.01917'>[paper]</a> Prefix Conditioning Unifies Language and Label Supervision. Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee, Kate Saenko, Tomas Pfister. arXiv:2206.01125, 2022. <a href='https://arxiv.org/abs/2206.01125'>[paper]</a> MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining. Xiaoyi Dong, Yinglin Zheng, Jianmin Bao, Ting Zhang, Dongdong Chen, Hao Yang, Ming Zeng, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu. arXiv:2208.12262, 2021 <a href='https://arxiv.org/abs/2208.12262'>[paper]</a> [BASIC] Combined Scaling for Open-Vocabulary Image Classification. Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le. arXiv:2111.10050, 2022. <a href='https://arxiv.org/abs/2111.10050'>[paper]</a> [LIMoE] Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts. Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, Neil Houlsby. arXiv:2206.02770, 2022. <a href='https://arxiv.org/abs/2206.02770'>[paper]</a> MURAL: Multimodal, Multitask Retrieval Across Languages. Aashi Jain, Mandy Guo, Krishna Srinivasan, Ting Chen, Sneha Kudugunta, Chao Jia, Yinfei Yang, Jason Baldridge. arXiv:2109.05125, 2022. <a href='https://arxiv.org/abs/2109.05125'>[paper]</a> PaLI: A Jointly-Scaled Multilingual Language-Image Model. Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut. arXiv:2209.06794, 2022. <a href='https://arxiv.org/abs/2209.06794'>[paper]</a> Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese. An Yang, Junshu Pan, Junyang Lin, Rui Men, Yichang Zhang, Jingren Zhou, Chang Zhou. arXiv:2211.01335, 2022. <a href='https://arxiv.org/abs/2211.01335'>[paper]</a> <a href='https://github.com/OFA-Sys/Chinese-CLIP'>[project]</a> [FLIP] Scaling Language-Image Pre-training via Masking. Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He. arXiv:2212.00794, 2022. <a href='https://arxiv.org/abs/2212.00794'>[paper]</a> Self Supervision Does Not Help Natural Language Supervision at Scale. Floris Weers, Vaishaal Shankar, Angelos Katharopoulos, Yinfei Yang, Tom Gunter. arXiv:2301.07836, 2023. <a href='https://arxiv.org/abs/2301.07836'>[paper]</a> STAIR: Learning Sparse Text and Image Representation in Grounded Tokens. Chen Chen, Bowen Zhang, Liangliang Cao, Jiguang Shen, Tom Gunter, Albin Madappally Jose, Alexander Toshev, Jonathon Shlens, Ruoming Pang, Yinfei Yang. arXiv:2301.13081, 2023. <a href='https://arxiv.org/abs/2301.13081'>[paper]</a> [REACT] Learning Customized Visual Models with Retrieval-Augmented Knowledge. Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Yong Jae Lee, Chunyuan Li. CVPR, 2023. <a href='https://arxiv.org/abs/2301.07094'>[paper]</a> <a href='https://react-vl.github.io/'>[project]</a> <a href='https://github.com/microsoft/react'>[code]</a>

:orange_book: Object Detection in the Wild

Open-Vocabulary Object Detection Using Captions. Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, Shih-Fu Chang. CVPR 2021. <a href='https://arxiv.org/abs/2011.10678'>[paper]</a> MDETR: Modulated Detection for End-to-End Multi-Modal Understanding. Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion. ICCV 2021. <a href='https://arxiv.org/abs/2104.12763'>[paper]</a> <a href='https://github.com/ashkamath/mdetr'>[code]</a> [VILD] Open-vocabulary Object Detection via Vision and Language Knowledge Distillation. Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, Yin Cui. ICLR 2022. <a href='https://arxiv.org/abs/2104.13921'>[paper]</a> [GLIP] Grounded Language-Image Pre-training. Liunian Harold Li*, Pengchuan Zhang*, Haotian Zhang*, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, Jianfeng Gao CVPR 2022. <a href='https://arxiv.org/abs/2112.03857'>[paper]</a> <a href='https://github.com/microsoft/GLIP'>[code]</a> RegionCLIP: Region-based Language-Image Pretraining. Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, Jianfeng Gao. CVPR 2022. <a href='https://arxiv.org/abs/2112.09106'>[paper]</a> <a href='https://github.com/microsoft/RegionCLIP'>[code]</a> [OV-DETR] Open-Vocabulary DETR with Conditional Matching Yuhang. Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, Chen Change Loy. ECCV 2022. <a href='https://arxiv.org/abs/2203.11876'>[paper]</a> <a href='https://github.com/yuhangzang/ov-detr'>[code]</a> [OWL-ViT] Simple Open-Vocabulary Object Detection with Vision Transformers. Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby. ECCV 2022. <a href='https://arxiv.org/abs/2205.06230'>[paper]</a> [Detic] Detecting Twenty-thousand Classes using Image-level Supervision. Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra. ECCV 2022. <a href='https://arxiv.org/abs/2201.02605'>[paper]</a> <a href='https://github.com/facebookresearch/Detic/'>[code]</a> X-DETR: A Versatile Architecture for Instance-wise Vision-Language Tasks. Zhaowei Cai, Gukyeong Kwon, Avinash Ravichandran, Erhan Bas, Zhuowen Tu, Rahul Bhotika, Stefano Soatto. ECCV 2022. <a href='https://arxiv.org/abs/2204.05626'>[paper]</a> FindIt: Generalized Localization with Natural Language Queries. Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova. ECCV 2022. <a href='https://arxiv.org/abs/2203.17273'>[paper]</a> Open Vocabulary Object Detection with Pseudo Bounding-Box Labels. Mingfei Gao, Chen Xing, Juan Carlos Niebles, Junnan Li, Ran Xu, Wenhao Liu, Caiming Xiong. ECCV 2022. <a href='https://arxiv.org/abs/2111.09452'>[paper]</a> <a href='https://github.com/salesforce/PB-OVD'>[code]</a> <a href='https://blog.salesforceairesearch.com/pb-ovd-open-vocabulary-object-detector/'>[blog]</a> PromptDet: Towards Open-vocabulary Detection using Uncurated Images. Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, Lin Ma. ECCV 2022. <a href='https://arxiv.org/abs/2203.16513'>[paper]</a> <a href='https://fcjian.github.io/promptdet/'>[code]</a> Class-agnostic Object Detection with Multi-modal Transformer. Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer and Ming-Hsuan Yang. ECCV 2022. <a href='https://arxiv.org/abs/2111.11430'>[paper]</a> <a href='https://github.com/mmaaz60/mvits_for_class_agnostic_od'>[code]</a> [Object Centric OVD] -- Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection. Hanoona Rasheed, Muhammad Maaz, Muhammad Uzair Khattak, Salman Khan, Fahad Shahbaz Khan. NeurIPS 2022 <a href='https://arxiv.org/abs/2207.03482'>[paper]</a> <a href='https://github.com/hanoonaR/object-centric-ovd'>[code]</a> [GLIPv2] Unifying Localization and Vision-Language Understanding. Haotian Zhang*, Pengchuan Zhang*, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao. NeurIPS 2022 <a href='https://arxiv.org/abs/2206.05836'>[paper]</a> <a href='https://github.com/microsoft/GLIP'>[code]</a> [FIBER] Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone. Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, Lijuan Wang. NeurIPS 2022 <a href='https://arxiv.org/abs/2206.07643'>[paper]</a> <a href='https://github.com/microsoft/FIBER'>[code]</a> [DetCLIP] Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection. Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei zhang, Zhenguo Li, Chunjing Xu, Hang Xu. NeurIPS 2022 <a href='https://arxiv.org/abs/2209.09407v1'>[paper]</a> OmDet: Language-Aware Object Detection with Large-scale Vision-Language Multi-dataset Pre-training. Tiancheng Zhao, Peng Liu, Xiaopeng Lu, Kyusong Lee arxiv:2209.0594 <a href='https://arxiv.org/abs/2209.05946'>[paper]</a> F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models . Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova. ICLR 2023. <a href='https://arxiv.org/abs/2209.15639'>[paper]</a> <a href='https://sites.google.com/view/f-vlm/home'>[project]</a> [OVAD] Open-vocabulary Attribute Detection . María A. Bravo, Sudhanshu Mittal, Simon Ging, Thomas Brox. arxiv:2211.12914 <a href='https://arxiv.org/abs/2211.12914'>[paper]</a> <a href='https://ovad-benchmark.github.io/'>[project]</a> Learning Object-Language Alignments for Open-Vocabulary Object Detection. Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, Jianfei Cai ICLR 2023. <a href='https://openreview.net/forum?id=mjHlitXvReu'>[paper]</a> <a href='https://github.com/clin1223/VLDet'>[code]</a> Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang Arxiv 2023. <a href='https://arxiv.org/abs/2303.05499v4'>[paper]</a> <a href='https://github.com/IDEA-Research/GroundingDINO'>[code]</a>

:orange_book: 3D Object Detection in the Wild

Open-Set 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning. Anonymous. <a href='https://openreview.net/forum?id=1yclzf1DWsf'>[OpenReview]</a>

:orange_book: Segmentation in the Wild

PhraseCut: Language-based Image Segmentation in the Wild Chenyun Wu, Zhe Lin, Scott Cohen, Trung Bui, Subhransu Maji. CVPR 2020. <a href='https://arxiv.org/abs/2008.01187'>[paper]</a> <a href='https://people.cs.umass.edu/~chenyun/publication/phrasecut/'>[project]</a> GroupViT: Semantic Segmentation Emerges from Text Supervision Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang. CVPR 2022. <a href='https://arxiv.org/abs/2202.11094'>[paper]</a> <a href='https://github.com/NVlabs/GroupViT/'>[code]</a> <a href='https://jerryxu.net/GroupViT/'>[project]</a> [CLIPSeg] Image Segmentation Using Text and Image Prompts. Timo Lüddecke, Alexander S. Ecker. CVPR 2022. <a href='https://arxiv.org/abs/2112.10003'>[paper]</a> <a href='https://github.com/timojl/clipseg'>[code]</a> DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu. CVPR, 2022 <a href='https://arxiv.org/abs/2112.01518'>[paper]</a> <a href='https://github.com/raoyongming/DenseCLIP'>[code]</a> Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling. Dat Huynh, Jason Kuen, Zhe Lin, Jiuxiang Gu, Ehsan Elhamifar. CVPR 2022. <a href='https://arxiv.org/abs/2111.12698'>[paper]</a> [ZegFormer] Decoupling Zero-Shot Semantic Segmentation. Jian Ding, Nan Xue, Gui-Song Xia, Dengxin Dai. CVPR 2022. <a href='https://openaccess.thecvf.com/content/CVPR2022/papers/Ding_Decoupling_Zero-Shot_Semantic_Segmentation_CVPR_2022_paper.pdf'>[paper]</a> <a href='https://github.com/dingjiansw101/ZegFormer'>[code]</a> [OpenSeg] Scaling Open-Vocabulary Image Segmentation with Image-Level Labels. Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin. ECCV 2022. <a href='https://arxiv.org/abs/2112.12143'>[paper]</a> A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model. Mengde Xu, Zheng Zhang, Fangyun Wei, Yutong Lin, Yue Cao, Han Hu, Xiang Bai. ECCV 2022 <a href='https://arxiv.org/abs/2112.14757'>[paper]</a> <a href='https://github.com/MendelXu/zsseg.baseline'>[code]</a> [MaskCLIP] Extract Free Dense Labels from CLIP. Chong Zhou, Chen Change Loy, Bo Dai. ECCV 2022 <a href='https://arxiv.org/abs/2112.01071'>[paper]</a> <a href='https://github.com/chongzhou96/MaskCLIP'>[code]</a> Open-Vocabulary Panoptic Segmentation with MaskCLIP. Zheng Ding, Jieke Wang, Zhuowen Tu. arXiv:2208.08984, 2022 <a href='https://arxiv.org/abs/2208.08984'>[paper]</a> [LSeg] Language-driven Semantic Segmentation. Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, René Ranftl. ICLR 2022. <a href='https://arxiv.org/pdf/2201.03546.pdf'>[paper]</a> <a href='https://github.com/isl-org/lang-seg'>[code]</a> Perceptual Grouping in Vision-Language Models. Kanchana Ranasinghe, Brandon McKinzie, Sachin Ravi, Yinfei Yang, Alexander Toshev, Jonathon Shlens. arXiv 2210.09996, 2022. <a href='https://arxiv.org/abs/2210.09996'>[paper]</a> [TCL] Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs. Junbum Cha, Jonghwan Mun, Byungseok Rho. CVPR 2023. <a href='https://arxiv.org/abs/2212.00785'>[paper]</a> <a href='https://github.com/kakaobrain/tcl'>[code]</a> <a href='https://huggingface.co/spaces/khanrc/tcl'>[demo]</a> [OVSeg] Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP. Feng Liang, Bichen Wu, Xiaoliang Dai, Kunpeng Li, Yinan Zhao, Hang Zhang, Peizhao Zhang, Peter Vajda, Diana Marculescu. CVPR 2023. <a href='https://arxiv.org/abs/2210.04150'>[paper]</a> <a href='https://jeff-liangf.github.io/projects/ovseg/'>[project]</a> <a href='https://huggingface.co/spaces/facebook/ov-seg'>[demo]</a> [X-Decoder] Generalized Decoding for Pixel, Image, and Language. Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee^, Jianfeng Gao^. CVPR 2023. <a href='https://arxiv.org/abs/2212.11270'>[paper]</a> <a href='https://x-decoder-vl.github.io/'>[project]</a> <a href='https://github.com/microsoft/X-Decoder/'>[code]</a> <a href='https://huggingface.co/spaces/xdecoder/Demo'>[demo]</a> [SAN] Side Adapter Network for Open-Vocabulary Semantic Segmentation. Mengde Xu, Zheng Zhang, Fangyun Wei, Han Hu, Xiang Bai. CVPR 2023. <a href='https://arxiv.org/abs/2302.12242'>[paper]</a> <a href='https://github.com/MendelXu/SAN'>[code]</a> [ODISE] Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models. Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiaolong Wang, Shalini De Mello. CVPR 2023. <a href='https://arxiv.org/abs/2303.04803'>[paper]</a> <a href='https://jerryxu.net/ODISE/'>[project]</a> <a href='https://github.com/NVlabs/ODISE'>[code]</a> <a href='https://huggingface.co/spaces/xvjiarui/ODISE'>[demo]</a> [OpenSeeD] A Simple Framework for Open-Vocabulary Segmentation and Detection. Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng Gao, Jianwei Yang, Lei Zhang. arxiv 2303.08131, 2023. <a href='https://arxiv.org/abs/2303.08131'>[paper]</a> <a href='https://github.com/IDEA-Research/OpenSeeD'>[code]</a> [SAM] Segment Anything. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. 2023. <a href='https://ai.facebook.com/research/publications/segment-anything/'>[paper]</a> <a href='https://github.com/facebookresearch/segment-anything'>[code]</a> <a href='https://segment-anything.com/'>[project]</a> <a href='https://segment-anything.com/demo/'>[demo]</a> [SEEM] Segment Everything Everywhere All at Once. Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, Yong Jae Lee. arxiv 2303.06718, 2023. <a href='https://arxiv.org/abs/2304.06718'>[paper]</a> <a href='https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once'>[code]</a>

:orange_book: 3D Segmentation in the Wild

Language-Grounded Indoor 3D Semantic Segmentation in the Wild. David Rozenberszki, Or Litany, Angela Dai. ECCV 2022. <a href='https://arxiv.org/abs/2204.07761'>[paper]</a> <a href='https://github.com/RozDavid/LanguageGroundedSemseg'>[code]</a> <a href='https://rozdavid.github.io/scannet200'>[project]</a> Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models. Huy Ha, Shuran Song. CoRL 2022. <a href='https://arxiv.org/abs/2207.11514'>[paper]</a> <a href='https://semantic-abstraction.cs.columbia.edu/'>[project]</a> OpenScene: 3D Scene Understanding with Open Vocabularies. Songyou Peng, Kyle Genova, Chiyu "Max" Jiang, Andrea Tagliasacchi, Marc Pollefeys, Thomas Funkhouser. CVPR 2023. <a href='https://arxiv.org/abs/2211.15654'>[paper]</a> <a href='https://pengsongyou.github.io/openscene'>[project]</a> Language-driven Open-Vocabulary 3D Scene Understanding Runyu Ding, Jihan Yang, Chuhui Xue, Wenqing Zhang, Song Bai, Xiaojuan Qi. arXiv:2211.16312 , 2022. <a href='https://arxiv.org/abs/2211.16312'>[paper]</a> <a href='https://github.com/CVMI-Lab/PLA'>[code]</a>

CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, Wenping Wang CVPR 2023. [paper] [code]

Towards Label-Free Scene Understanding by Vision Foundation Models Runnan Chen, Youquan Liu, Lingdong Kong, Nenglun Chen, Xinge Zhu, Yuexin Ma, Tongliang Liu, Wenping Wang NeurIPS 2023. [paper] [code]

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models Youquan Liu, Lingdong Kong, Jun Cen, Runnan Chen, Wenwei Zhang, Liang Pan, Kai Chen, Ziwei Liu NeurIPS 2023 (Spotlight). [paper] [code] [project]

:orange_book: Video Classification in the Wild

Expanding Language-Image Pretrained Models for General Video Recognition. Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling. ECCV 2022 <a href='https://arxiv.org/abs/2208.02816'>[paper]</a> <a href='https://github.com/microsoft/VideoX'>[code]</a> Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models. Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui. arXiv:2207.07646 <a href='https://arxiv.org/abs/2207.07646v1'>[paper]</a>

:orange_book: Other Visual Recognition in the Wild

RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection. Hangjie Yuan, Jianwen Jiang, Samuel Albanie, Tao Feng, Ziyuan Huang, Dong Ni, Mingqian Tang. NeurIPS 2022 <a href='https://arxiv.org/abs/2209.01814'>[paper]</a> <a href='https://github.com/JacobYuan7/RLIP'>[code]</a>

:orange_book: Grounded Image Generation in the Wild

:new: This is a new research topic: grounded image generation based on any open-set concept, include text and visual prompt. All the text-to-image pre-trained generation models allow open-set prompting at the image-level, and thus belong to ``Grounded Image Generation in the Wild'' by default. This paper collection focuses on more fine-grained controlability in the image generation, such as specifying new concept at the the level of bounding box, masks, edge/depth maps etc.

GLIGEN: Open-Set Grounded Text-to-Image Generation. Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, Yong Jae Lee. CVPR 2023 <a href='https://arxiv.org/abs/2301.07093'>[paper]</a> <a href='https://gligen.github.io/'>[project]</a> <a href='https://github.com/gligen/GLIGEN'>[code]</a> <a href='https://aka.ms/gligen'>[demo]</a> <a href='https://youtu.be/-MCkU7IAGKs'>[YouTube video]</a> ReCo: Region-Controlled Text-to-Image Generation. Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, Lijuan Wang. CVPR 2023 <a href='https://arxiv.org/pdf/2211.15518.pdf'>[paper]</a> [ControlNet] Adding Conditional Control to Text-to-Image Diffusion Models. Lvmin Zhang, Maneesh Agrawala. arxiv 2302.05543, 2023 <a href='https://arxiv.org/abs/2302.05543'>[paper]</a> <a href='https://github.com/lllyasviel/ControlNet'>[code]</a> T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. arxiv 2302.08453, 2023 <a href='https://arxiv.org/abs/2302.08453'>[paper]</a> <a href='https://github.com/TencentARC/T2I-Adapter'>[code]</a> Composer: Creative and Controllable Image Synthesis with Composable Conditions. Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, Jingren Zhou. arxiv 2302.09778 2023 <a href='https://arxiv.org/abs/2302.09778'>[paper]</a> <a href='https://damo-vilab.github.io/composer-page/'>[project]</a> <a href='https://github.com/damo-vilab/composer'>[code]</a> Universal Guidance for Diffusion Models. Arpit Bansal, Hong-Min Chu, Avi Schwarzschild, Soumyadip Sengupta, Micah Goldblum, Jonas Geiping, Tom Goldstein. arxiv2302.07121, 2023 <a href='https://arxiv.org/abs/2302.07121'>[paper]</a> <a href='https://github.com/arpitbansal297/Universal-Guided-Diffusion'>[code]</a> Modulating Pretrained Diffusion Models for Multimodal Image Synthesis. Cusuh Ham, James Hays, Jingwan Lu, Krishna Kumar Singh, Zhifei Zhang, Tobias Hinz. arxiv2302.12764, 2023 <a href='https://arxiv.org/abs/2302.12764'>[paper]</a>

:orange_book: Large Multimodal Models

:new: This is a new research topic: build general-purpose multimodal assistants based on large language models (LLM). One prominent example is OpenAI Multimodal GPT-4. A comphrensive list paper list is compiled at Awesome-Multimodal-Large-Language-Models. Our collection maintains the a brief list for the completenes of CVinW.

GPT-4 Technical Report. OpenAI arxiv2303.08774, 2023 <a href='https://arxiv.org/abs/2303.08774'>[paper]</a> Visual Instruction Tuning. Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee arxiv/2304.08485, 2023 <a href='https://arxiv.org/abs/2304.08485'>[paper]</a> <a href='https://llava-vl.github.io/'>[website]</a> MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Deyao Zhu* , Jun Chen*, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny arxiv2304.10592, 2023 <a href='https://arxiv.org/abs/2304.10592'>[paper]</a> <a href='https://minigpt-4.github.io/'>[website]</a> Large Multimodal Models: Notes on CVPR 2023 Tutorial. Chunyuan Li. CVPR Tutorial, 2023 <a href='https://datarelease.blob.core.windows.net/tutorial/vision_foundation_models_2023/slides/Chunyuan_cvpr2023_tutorial_lmm.pdf'>[📊 Slides]</a> <a href='https://arxiv.org/abs/2306.14895'>[📝 Notes]</a> <a href='https://youtu.be/mkI7EPD1vp8'>[🎥 YouTube]</a> <a href='https://www.bilibili.com/video/BV1Ng4y1T7v3/'>[🇨🇳 Bilibli]</a> <a href='https://vlp-tutorial.github.io/'>[🌐 Tutorial Webiste]</a>

:snowflake: Papers on Efficient Model Adaptation

:blue_book: Parameter-Efficient Methods

$\colorbox{powderblue}{Prompt}$ $\colorbox{tomato}{Adapter}$

[CoOp] Learning to Prompt for Vision-Language Models. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu. IJCV 2022 <a href='https://link.springer.com/article/10.1007/s11263-022-01653-1'>[paper]</a> <a href='https://github.com/KaiyangZhou/CoOp'>[code]</a> [CoCoOp] Conditional Prompt Learning for Vision-Language Models.. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, Ziwei Liu. CVPR 2022 <a href='https://arxiv.org/abs/2203.05557'>[paper]</a> <a href='https://github.com/KaiyangZhou/CoOp'>[code]</a> Prompt Distribution Learning. Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu, Xinmei Tian. CVPR 2022 <a href='https://arxiv.org/abs/2205.03340'>[paper]</a> VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks. Yi-Lin Sung, Jaemin Cho, Mohit Bansal. CVPR, 2022 <a href='https://arxiv.org/abs/2112.06825'>[paper]</a> <a href='https://github.com/ylsung/VL_adapter'>[code]</a> [VPT] Visual Prompt Tuning. Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, Ser-Nam Lim. ECCV 2022 <a href='https://arxiv.org/abs/2203.12119'>[paper]</a> Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling. Renrui Zhang, Rongyao Fang, Wei Zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li. ECCV 2022 <a href='https://arxiv.org/abs/2111.03930'>[paper]</a> <a href='https://github.com/gaopengcuhk/Tip-Adapter'>[code]</a> CLIP-Adapter: Better Vision-Language Models with Feature Adapters. Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao. arXiv:2110.04544 2022 <a href='https://arxiv.org/abs/2110.04544'>[paper]</a> <a href='https://github.com/gaopengcuhk/CLIP-Adapter'>[code]</a> Exploring Visual Prompts for Adapting Large-Scale Models. Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, Phillip Isola. arXiv:2203.17274 2022 <a href='https://arxiv.org/abs/2203.17274'>[paper]</a> <a href='https://github.com/hjbahng/visual_prompting'>[code]</a> <a href='https://hjbahng.github.io/visual_prompting/'>[project]</a> [NOAH] Neural Prompt Search. Yuanhan Zhang, Kaiyang Zhou, Ziwei Liu. arXiv:2206.04673, 2022 <a href='https://arxiv.org/abs/2206.04673'>[paper]</a> <a href='https://github.com/ZhangYuanhan-AI/NOAH'>[code]</a> Parameter-efficient Fine-tuning for Vision Transformers. Xuehai He, Chunyuan Li, Pengchuan Zhang, Jianwei Yang, Xin Eric Wang. arXiv:2203.16329, 2022 <a href='https://arxiv.org/abs/2203.16329'>[paper]</a> [CuPL] What does a platypus look like? Generating customized prompts for zero-shot image classification. Sarah Pratt, Rosanne Liu, Ali Farhadi. arXiv:2209.03320, 2022 <a href='https://arxiv.org/abs/2209.03320'>[paper]</a> <a href='https://github.com/sarahpratt/CuPL'>[code]</a> AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo. arXiv:2205.13535, 2022 <a href='https://arxiv.org/abs/2205.13535'>[paper]</a> <a href='https://github.com/ShoufaChen/AdaptFormer'>[code]</a> Vision Transformer Adapter for Dense Predictions. Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao. arXiv:2205.08534, 2022 <a href='https://arxiv.org/abs/2205.08534'>[paper]</a> <a href='https://github.com/czczup/ViT-Adapter'>[code]</a> Convolutional Bypasses Are Better Vision Transformer Adapters. Shibo Jie, Zhi-Hong Deng. arXiv:2207.07039, 2022 <a href='https://arxiv.org/abs/2207.07039'>[paper]</a> <a href='https://github.com/JieShibo/PETL-ViT'>[code]</a> Conv-Adapter: Exploring Parameter Efficient Transfer Learning for ConvNets. Hao Chen, Ran Tao, Han Zhang, Yidong Wang, Wei Ye, Jindong Wang, Guosheng Hu, Marios Savvides. arXiv:2208.07463, 2022 <a href='https://arxiv.org/abs/2208.07463'>[paper]</a> VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control. Zi-Yuan Hu, Yanyang Li, Michael R. Lyu, Liwei Wang. ICCV, 2023 <a href='https://arxiv.org/abs/2308.09804'>[paper]</a> <a href='https://github.com/HenryHZY/VL-PET'>[code]</a> <a href='https://henryhzy.github.io/VL-PET'>[project]</a> Contrastive Prompt Tuning Improves Generalization in Vision-Language Models. Anonymous. <a href='https://openreview.net/forum?id=g4JB0ksCrKe'>[OpenReview]</a> Variational Prompt Tuning Improves Generalization of Vision-Language Models. Anonymous. <a href='https://openreview.net/forum?id=t2qu5Hotedi'>[OpenReview]</a> MaPLe: Multi-modal Prompt Learning. Anonymous. <a href='https://openreview.net/forum?id=8mWlBArp1qx'>[OpenReview]</a> Prompt Tuning with Prompt-aligned Gradient for Vision-Language Models. Anonymous. <a href='https://openreview.net/forum?id=TSqKS0lQQA6'>[OpenReview]</a> Prompt Learning with Optimal Transport for Vision-Language Models. Anonymous. <a href='https://openreview.net/forum?id=zqwryBoXYnh'>[OpenReview]</a> Unified Vision and Language Prompt Learning. Anonymous. <a href='https://openreview.net/forum?id=1QQnYd02etI'>[OpenReview]</a> Rethinking the Value of Prompt Learning for Vision-Language Models. Anonymous. <a href='https://openreview.net/forum?id=1FsdIfRngtw'>[OpenReview]</a> Language-Aware Soft Prompting for Vision & Language Foundation Models. Anonymous. <a href='https://openreview.net/forum?id=W4HBwaybWedX'>[OpenReview]</a>

:blue_book: Other Efficient Model Adaptation Methods

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition. Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li. NeurIPS 2022 <a href='https://arxiv.org/abs/2206.13559'>[paper]</a> [DetPro] Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model. Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, Guoqi Li. CVPR 2022 <a href='https://arxiv.org/abs/2203.14940'>[paper]</a> <a href='https://github.com/dyabel/detpro'>[code]</a> Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting. Guangxing Han, Jiawei Ma, Shiyuan Huang, Long Chen, Rama Chellappa, and Shih-Fu Chang. arXiv:2204.07841 2022 <a href='https://arxiv.org/pdf/2204.07841.pdf'>[paper]</a> Learning to Compose Soft Prompts for Compositional Zero-Shot Learning. Nihal V. Nayak, Peilin Yu, Stephen H. Bach. arXiv:2204.03574 2022 <a href='https://arxiv.org/abs/2204.03574'>[paper]</a> <a href='https://github.com/BatsResearch/csp'>[code]</a> [WiSE-FT] Robust Fine-Tuning of Zero-Shot Models. Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, Ludwig Schmidt. CVPR 2022 <a href='https://arxiv.org/abs/2109.01903'>[paper]</a> <a href='https://github.com/mlfoundations/wise-ft'>[code]</a> [PAINT] Patching open-vocabulary models by interpolating weights. Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, Ludwig Schmidt. arXiv:2208.05592 2022 <a href='https://arxiv.org/abs/2208.05592'>[paper]</a> Learning to Prompt for Continual Learning. Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister. CVPR, 2022 <a href='https://arxiv.org/abs/2112.08654'>[paper]</a> <a href='https://github.com/google-research/l2p'>[code]</a> Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos. Muheng Li, Lei Chen, Yueqi Duan, Zhilan Hu, Jianjiang Feng, Jie Zhou, Jiwen Lu. CVPR, 2022 <a href='https://arxiv.org/abs/2203.14104'>[paper]</a> <a href='https://github.com/ttlmh/Bridge-Prompt'>[code]</a> Prompting Visual-Language Models for Efficient Video Understanding. Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie. ECCV, 2022 <a href='https://arxiv.org/pdf/2112.04478v2.pdf'>[paper]</a> <a href='https://github.com/hjbahng/visual_prompting'>[code]</a> Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks. Yen-Cheng Liu, Chih-Yao Ma, Junjiao Tian, Zijian He, Zsolt Kira. NeurIPS, 2022 <a href='https://arxiv.org/abs/2210.03265'>[paper]</a> <a href='https://ycliu93.github.io/projects/polyhistor.html'>[project]</a> Prompt Vision Transformer for Domain Generalization. Zangwei Zheng, Xiangyu Yue, Kai Wang, Yang You. arXiv:2208.08914, 2022 <a href='https://arxiv.org/abs/2208.08914'>[paper]</a> Domain Adaptation via Prompt Learning. Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, Gao Huang. arXiv:2202.06687, 2022 <a href='https://arxiv.org/abs/2202.06687'>[paper]</a> Visual Prompting via Image Inpainting. Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, Alexei A. Efros. arXiv:2209.00647, 2022 <a href='https://arxiv.org/abs/2209.00647'>[paper]</a> <a href='https://yossigandelsman.github.io/visual_prompt/'>[code]</a> Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models. Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, Phillip Isola. arXiv:2203.17274, 2022 <a href='https://arxiv.org/abs/2203.17274'>[paper]</a> <a href='https://hjbahng.github.io/visual_prompting/'>[code]</a>

:eye: Papers on Out-of-domain Generalization

:green_book: Surveys

Towards Out-Of-Distribution Generalization: A Survey. Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, Peng Cui. arXiv:2108.13624, 2021 <a href='https://arxiv.org/abs/2108.13624'>[paper]</a> Generalizing to Unseen Domains: A Survey on Domain Generalization. Wang, Jindong, Cuiling Lan, Chang Liu, Yidong Ouyang, Wenjun Zeng, and Tao Qin. IJCAI 2021 <a href='https://arxiv.org/abs/2103.03097'>[paper]</a> Domain Generalization: A Survey. Zhou, Kaiyang, Ziwei Liu, Yu Qiao, Tao Xiang, and Chen Change Loy. TPAMI 2022 <a href='https://arxiv.org/abs/2103.02503'>[paper]</a>

:green_book: Out-of-domain Generalization

Episodic Training for Domain Generalization. Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe Song, Timothy M. Hospedales. ICCV 2019 <a href='https://arxiv.org/abs/1902.00113'>[paper]</a><a href='https://github.com/HAHA-DL/Episodic-DG'>[code]</a> Domain Generalization by Solving Jigsaw Puzzles. Fabio Maria Carlucci, Antonio D'Innocente, Silvia Bucci, Barbara Caputo, Tatiana Tommasi. CVPR 2019 <a href='https://arxiv.org/abs/1903.06864'>[paper]</a><a href='https://github.com/fmcarlucci/JigenDG'>[code]</a> Invariant Risk Minimization. Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, David Lopez-Paz. arXiv:1907.02893, 2019 <a href='https://arxiv.org/abs/1907.02893'>[paper]</a> <a href='https://github.com/facebookresearch/InvariantRiskMinimization'>[code]</a> Deep Domain-Adversarial Image Generation for Domain Generalisation. Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, Tao Xiang. AAAI 2020 <a href='https://arxiv.org/abs/2003.06054'>[paper] Learning to Generate Novel Domains for Domain Generalization. Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, Tao Xiang. ECCV 2020 <a href='https://arxiv.org/abs/2007.03304'>[paper] Improved OOD Generalization via Adversarial Training and Pre-training. Mingyang Yi, Lu Hou, Jiacheng Sun, Lifeng Shang, Xin Jiang, Qun Liu, Zhi-Ming Ma. ICML 2021 <a href='https://arxiv.org/abs/2105.11144'>[paper]</a> Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization. Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, Percy Liang. ICLR 2020 <a href='https://arxiv.org/abs/1911.08731'>[paper]</a> <a href='https://github.com/facebookresearch/DomainBed/blob/51810e60c01fbfcf8f2db918b882e4445b8b6527/domainbed/algorithms.py#L445'>[code]</a> Domain Generalization with MixStyle. Kaiyang Zhou, Yongxin Yang, Yu Qiao, Tao Xiang. ICLR 2021 <a href='Domain Generalization with MixStyle'>[paper]</a><a href='https://github.com/KaiyangZhou/mixstyle-release'>[code]</a> In Search of Lost Domain Generalization. Ishaan Gulrajani, David Lopez-Paz. ICLR 2021 <a href='https://arxiv.org/abs/2007.01434'>[paper]</a> <a href='https://github.com/facebookresearch/DomainBed'>[code]</a> Out-of-Distribution Generalization via Risk Extrapolation (REx). David Krueger, Ethan Caballero, Joern-Henrik Jacobsen, Amy Zhang, Jonathan Binas, Dinghuai Zhang, Remi Le Priol, Aaron Courville. ICLR 2021 <a href='https://arxiv.org/abs/2003.00688'>[paper]</a> <a href='https://github.com/facebookresearch/DomainBed/blob/51810e60c01fbfcf8f2db918b882e4445b8b6527/domainbed/algorithms.py#L367'>[code]</a> SWAD: Domain Generalization by Seeking Flat Minima. Junbum Cha, Sanghyuk Chun, Kyungjae Lee, Han-Cheol Cho, Seunghyun Park, Yunsung Lee, Sungrae Park. NeurIPS 2021 <a href='https://arxiv.org/abs/2102.08604'>[paper]</a> <a href='https://github.com/khanrc/swad'>[code]</a> Domain Invariant Representation Learning with Domain Density Transformations. Marco Federici, Ryota Tomioka, Patrick Forré. NeurIPS 2021 <a href='https://arxiv.org/abs/2106.03783'>[paper]</a> <a href='https://github.com/atuannguyen/DIRT'>[code]</a> An Information-theoretic Approach to Distribution Shifts. A. Tuan Nguyen, Toan Tran, Yarin Gal, Atılım Güneş Baydin. NeurIPS 2021 <a href='https://arxiv.org/abs/2102.05082'>[paper]</a> <a href='https://github.com/mfederici/dsit'>[code]</a> Gradient Matching for Domain Generalization. Yuge Shi, Jeffrey Seely, Philip H.S. Torr, N. Siddharth, Awni Hannun, Nicolas Usunier, Gabriel Synnaeve. ICLR 2022 <a href='https://arxiv.org/abs/2104.09937'>[paper]</a> <a href='https://github.com/YugeTen/fish'>[code]</a> Visual Representation Learning Does Not Generalize Strongly Within the Same Domain. Lukas Schott, Julius von Kügelgen, Frederik Träuble, Peter Gehler, Chris Russell, Matthias Bethge, Bernhard Schölkopf, Francesco Locatello, Wieland Brendel. ICLR 2022 <a href='https://arxiv.org/abs/2107.08221'>[paper]</a> <a href='https://github.com/bethgelab/InDomainGeneralizationBenchmark'>[code]</a> MetaShift: A Dataset of Datasets for Evaluating Contextual Distribution Shifts and Training Conflicts. Weixin Liang and James Zou. ICLR 2022 <a href='https://arxiv.org/abs/2202.06523'>[paper]</a> <a href='https://github.com/Weixin-Liang/MetaShift'>[code]</a> Fishr: Invariant Gradient Variances for Out-of-Distribution Generalization. Alexandre Rame, Corentin Dancette, Matthieu Cord. ICML 2022 <a href='https://arxiv.org/abs/2109.02934'>[paper]</a> <a href='https://github.com/alexrame/fishr'>[code]</a> [MIRO] Domain Generalization by Mutual-Information Regularization with Pre-trained Models. Junbum Cha, Kyungjae Lee, Sungrae Park, Sanghyuk Chun. ECCV 2022 <a href='https://arxiv.org/abs/2203.10789'>[paper]</a> <a href='https://github.com/kakaobrain/miro'>[code]</a> Diverse Weight Averaging for Out-of-Distribution Generalization. Alexandre Rame, Matthieu Kirchmeyer, Thibaud Rahier, Alain Rakotomamonjy, Patrick Gallinari, Matthieu Cord. NeurIPS 2022 <a href='https://arxiv.org/abs/2205.09739'>[paper]</a> <a href='https://github.com/alexrame/diwa'>[code]</a> Wild-Time: A Benchmark of in-the-Wild Distribution Shift over Time. Huaxiu Yao, Caroline Choi, Bochuan Cao, Yoonho Lee, Pang Wei Koh, Chelsea Finn. NeurIPS 2022 <a href='https://arxiv.org/abs/2211.14238'>[paper]</a> <a href='https://github.com/huaxiuyao/Wild-Time'>[code]</a> WILDS: A Benchmark of in-the-Wild Distribution Shifts. Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang. PMLR 2022 <a href='https://arxiv.org/abs/2012.07421'>[paper]</a> <a href='https://wilds.stanford.edu/'>[code]</a> SIMPLE: Specialized Model-Sample Matching for Domain Generalization. Ziyue Li, Kan Ren, Xinyang Jiang, Yifei Shen, Haipeng Zhang, Dongsheng Li ICLR 2023 <a href='https://openreview.net/forum?id=BqrPeZ_e5P'>[paper]</a> <a href='https://seqml.github.io/simple/'>[code]</a> Sparse Mixture-of-Experts are Domain Generalizable Learners. Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, Ziwei Liu ICLR 2023 <a href='https://openreview.net/forum?id=q08xeIw1HA1'>[paper]</a> <a href='https://github.com/Luodian/Generalizable-Mixture-of-Experts'>[code]</a>

:green_book: Robust Models

Evaluating Model Robustness and Stability to Dataset Shift. Adarsh Subbaswamy, Roy Adams, Suchi Saria. AISTATS 2021 <a href='https://arxiv.org/abs/2010.15100'>[paper]</a> <a href='https://github.com/asubbaswamy/stability-analysis'>[code]</a> Plex: Towards Reliability using Pretrained Large Model Extensions. Dustin Tran, Jeremiah Liu, Michael W. Dusenberry, Du Phan, Mark Collier, Jie Ren, Kehang Han, Zi Wang, Zelda Mariet, Huiyi Hu, Neil Band, Tim G. J. Rudner, Karan Singhal, Zachary Nado, Joost van Amersfoort, Andreas Kirsch, Rodolphe Jenatton, Nithum Thain, Honglin Yuan, Kelly Buchanan, Kevin Murphy, D. Sculley, Yarin Gal, Zoubin Ghahramani, Jasper Snoek, Balaji Lakshminarayanan. First Workshop of Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML 2022 <a href='https://arxiv.org/abs/2207.07411'>[paper]</a> <a href='https://goo.gle/plex-code'>[code]</a>

:beers: Acknowledgements

We thank all the authors above for their great works! Related Reading List

If you find this repository useful, please consider giving a star :star: and cite the related papers :beer::

@article{li2022elevater,
    title={ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models},
    author={Li, Chunyuan and Liu, Haotian and Li, Liunian Harold and Zhang, Pengchuan and Aneja, Jyoti and Yang, Jianwei and Jin, Ping and Hu, Houdong and Liu, Zicheng and Lee, Yong Jae and Gao, Jianfeng},
    journal={Neural Information Processing Systems},
    year={2022}
}

@article{li2023multimodal,
    title={Multimodal Foundation Models: From Specialists to General-Purpose Assistants},
    author={Li, Chunyuan and Gan, Zhe and Yang, Zhengyuan and Yang, Jianwei and Li, Linjie and Wang, Lijuan and Gao, Jianfeng},
    journal={arXiv:2309.10020},
    year={2023}
}

@article{gan2022vision,
  title={Vision-language pre-training: Basics, recent advances, and future trends},
  author={Gan, Zhe and Li, Linjie and Li, Chunyuan and Wang, Lijuan and Liu, Zicheng and Gao, Jianfeng},
  journal={Foundations and Trends{\textregistered} in Computer Graphics and Vision},
  volume={14},
  number={3--4},
  pages={163--352},
  year={2022},
  publisher={Now Publishers, Inc.}
}