Awesome

Awesome Vision-Language Models

This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

Vision-Language Models for Vision Tasks: A Survey [Paper]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

🤩 Our paper is selected into TPAMI Top 50 Popular Paper List !!

Feel free to pull requests or contact us if you find any related papers that are not included here.

The process to submit a pull request is as follows:

a. Fork the project into your own repository.
b. Add the Title, Paper link, Conference, Project/Code link in README.md using the following format:

  |[Title](Paper Link)|Conference|[Code/Project](Code/Project link)|

c. Submit the pull request to this branch.

🔥 News

Last update on 2024/11/26

VLM Pre-training Methods

[arXiv 2024] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper][Code]
[CVPR 2024] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback [Paper][Code]
[CVPR 2024] Do Vision and Language Encoders Represent the World Similarly? [Paper][Code]
[CVPR 2024] Efficient Vision-Language Pre-training by Cluster Masking [Paper][Code]
[CVPR 2024] Non-autoregressive Sequence-to-Sequence Vision-Language Models [Paper]
[CVPR 2024] ViTamin: Designing Scalable Vision Models in the Vision-Language Era [Paper][Code]
[CVPR 2024] Iterated Learning Improves Compositionality in Large Vision-Language Models [Paper]
[CVPR 2024] FairCLIP: Harnessing Fairness in Vision-Language Learning [Paper][Code]
[CVPR 2024] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [Paper][Code]
[CVPR 2024] VILA: On Pre-training for Visual Language Models [Paper]
[CVPR 2024] Generative Region-Language Pretraining for Open-Ended Object Detection [Paper][Code]
[CVPR 2024] Enhancing Vision-Language Pre-training with Rich Supervisions [Paper]
[ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [Paper][Code]
[ICLR 2024] MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [Paper][Code]
[ICLR 2024] Retrieval-Enhanced Contrastive Vision-Text Models [Paper]
[arXiv 2024] CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions [Paper][Code]

VLM Transfer Learning Methods

[NeurIPS 2024] Historical Test-time Prompt Tuning for Vision Foundation Models [Paper]
[NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation [Paper][Code]
[IJCV 2024] Progressive Visual Prompt Learning with Contrastive Feature Re-formation [Paper][Code]
[ECCV 2024] CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [Paper][Code]
[ECCV 2024] FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance [Paper][Code]
[ECCV 2024] GalLoP: Learning Global and Local Prompts for Vision-Language Models [Paper]
[ECCV 2024] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [Paper][Code]
[CVPR 2024] Towards Better Vision-Inspired Vision-Language Models [Paper]
[CVPR 2024] One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models [Paper][Code]
[CVPR 2024] Any-Shift Prompting for Generalization over Distributions [Paper]
[CVPR 2024] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models [Paper][Code]
[CVPR 2024] Anchor-based Robust Finetuning of Vision-Language Models [Paper]
[CVPR 2024] Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners [Paper][Code]
[CVPR 2024] Visual In-Context Prompting [Paper][Code]
[CVPR 2024] TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model [Paper][Code]
[CVPR 2024] Efficient Test-Time Adaptation of Vision-Language Models [Paper][Code]
[CVPR 2024] Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models [Paper][Code]
[ICLR 2024] DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning [Paper][Code]
[ICLR 2024] Nemesis: Normalizing the soft-prompt vectors of vision-language models [Paper]
[ICLR 2024] Prompt Gradient Projection for Continual Learning [Paper]
[ICLR 2024] An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models [Paper]
[ICLR 2024] Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching [Paper][Code]
[ICLR 2024] Text-driven Prompt Generation for Vision-Language Models in Federated Learning [Paper]
[ICLR 2024] Consistency-guided Prompt Learning for Vision-Language Models [Paper]
[ICLR 2024] C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion [Paper]
[arXiv 2024] Learning to Prompt Segment Anything Models [Paper]

VLM Knowledge Distillation for Detection

[NeurIPS 2024] Open-Vocabulary Object Detection via Language Hierarchy [Paper]
[CVPR 2024] RegionGPT: Towards Region Understanding Vision Language Model [Paper][Code]
[ICLR 2024] LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [Paper]
[ICLR 2024] Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction [Paper]

VLM Knowledge Distillation for Segmentation

[ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [Paper]

VLM Knowledge Distillation for Other Vision Tasks

[ICLR 2024] FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition [Paper][Project]
[ICLR 2024] AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection [Paper][Code]
[CVPR 2023] EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata [Paper][Code]

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2024vision,
  title={Vision-language models for vision tasks: A survey},
  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

Datasets
- Datasets for VLM Pre-training
- Datasets for VLM Evaluation
Vision-Language Pre-training Methods
Vision-Language Model Transfer Learning Methods
Vision-Language Model Knowledge Distillation Methods
- Knowledge Distillation for Object Detection
- Knowledge Distillation for Semantic Segmentation

Datasets

Datasets for VLM Pre-training

Dataset	Year	Num of Image-Text Paris	Language	Project
SBU Caption	2011	1M	English	Project
COCO Caption	2016	1.5M	English	Project
Yahoo Flickr Creative Commons 100 Million	2016	100M	English	Project
Visual Genome	2017	5.4M	English	Project
Conceptual Captions 3M	2018	3.3M	English	Project
Localized Narratives	2020	0.87M	English	Project
Conceptual 12M	2021	12M	English	Project
Wikipedia-based Image Text	2021	37.6M	108 Languages	Project
Red Caps	2021	12M	English	Project
LAION400M	2021	400M	English	Project
LAION5B	2022	5B	Over 100 Languages	Project
WuKong	2022	100M	Chinese	Project
CLIP	2021	400M	English	-
ALIGN	2021	1.8B	English	-
FILIP	2021	300M	English	-
WebLI	2022	12B	English	-

Datasets for VLM Evaluation

Image Classification

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
MNIST	1998	10	60,000	10,000	Accuracy	Project
Caltech-101	2004	102	3,060	6,085	Mean Per Class	Project
PASCAL VOC 2007	2007	20	5,011	4,952	11-point mAP	Project
Oxford 102 Flowers	2008	102	2,040	6,149	Mean Per Class	Project
CIFAR-10	2009	10	50,000	10,000	Accuracy	Project
CIFAR-100	2009	100	50,000	10,000	Accuracy	Project
ImageNet-1k	2009	1000	1,281,167	50,000	Accuracy	Project
SUN397	2010	397	19,850	19,850	Accuracy	Project
SVHN	2011	10	73,257	26,032	Accuracy	Project
STL-10	2011	10	1,000	8,000	Accuracy	Project
GTSRB	2011	43	26,640	12,630	Accuracy	Project
KITTI Distance	2012	4	6,770	711	Accuracy	Project
IIIT5k	2012	36	2,000	3,000	Accuracy	Project
Oxford-IIIT PETS	2012	37	3,680	3,669	Mean Per Class	Project
Stanford Cars	2013	196	8,144	8,041	Accuracy	Project
FGVC Aircraft	2013	100	6,667	3,333	Mean Per Class	Project
Facial Emotion	2013	8	32,140	3,574	Accuracy	Project
Rendered SST2	2013	2	7,792	1,821	Accuracy	Project
Describable Textures	2014	47	3,760	1,880	Accuracy	Project
Food-101	2014	101	75,750	25,250	Accuracy	Project
Birdsnap	2014	500	42,283	2,149	Accuracy	Project
RESISC45	2017	45	3,150	25,200	Accuracy	Project
CLEVR Counts	2017	8	2,000	500	Accuracy	Project
PatchCamelyon	2018	2	294,912	32,768	Accuracy	Project
EuroSAT	2019	10	10,000	5,000	Accuracy	Project
Hateful Memes	2020	2	8,500	500	ROC AUC	Project
Country211	2021	211	43,200	21,100	Accuracy	Project

Image-Text Retrieval

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
Flickr30k	2014	-	31,783	-	Recall	Project
COCO Caption	2015	-	82,783	5,000	Recall	Project

Action Recognition

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
UCF101	2012	101	9,537	1,794	Accuracy	Project
Kinetics700	2019	700	494,801	31,669	Mean (top1, top5)	Project
RareAct	2020	122	7,607	-	mWAP, mSAP	Project

Object Detection

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
COCO 2014 Detection	2014	80	83,000	41,000	Box mAP	Project
COCO 2017 Detection	2017	80	118,000	5,000	Box mAP	Project
LVIS	2019	1203	118,000	5,000	Box mAP	Project
ODinW	2022	314	132,413	20,070	Box mAP	Project

Semantic Segmentation

Dataset	Year	Classes	Training	Testing	Evaluation Metric	Project
PASCAL VOC 2012	2012	20	1,464	1,449	mIoU	Project
PASCAL Content	2014	459	4,998	5,105	mIoU	Project
Cityscapes	2016	19	2,975	500	mIoU	Project
ADE20k	2017	150	25,574	2,000	mIoU	Project

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

Paper	Published in	Code/Project
CLIP: Learning Transferable Visual Models From Natural Language Supervision	ICML 2021	Code
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision	ICML 2021	-
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport Distillation	arXiv 2021	Code
Florence: A New Foundation Model for Computer Vision	arXiv 2021	-
RegionClip: Region-based Language-Image Pretraining	arXiv 2021	Code
DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm	ICLR 2022	Code
FILIP: Fine-grained Interactive Language-Image Pre-Training	ICLR 2022	-
KELIP: Large-scale Bilingual Language-Image Contrastive Learning	ICLRW 2022	Code
ZeroVL: Contrastive Vision-Language Pre-training with Limited Resources	ECCV 2022	Code
SLIP: Self-supervision meets Language-Image Pre-training	ECCV 2022	Code
UniCL: Unified Contrastive Learning in Image-Text-Label Space	CVPR 2022	Code
LiT: Zero-Shot Transfer with Locked-image text Tuning	CVPR 2022	Code
GroupViT: Semantic Segmentation Emerges from Text Supervision	CVPR 2022	Code
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining	NeurIPS 2022	-
UniCLIP: Unified Framework for Contrastive Language-Image Pre-training	NeurIPS 2022	-
K-LITE: Learning Transferable Visual Models with External Knowledge	NeurIPS 2022	Code
FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone	NeurIPS 2022	Code
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese	arXiv 2022	Code
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities	arXiv 2022	Code
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation	arXiv 2022	Code
NLIP: Noise-robust Language-Image Pre-training	AAAI 2023	-
PaLI: A Jointly-Scaled Multilingual Language-Image Model	ICLR 2023	Project
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention	ICLR 2023	Code
CLIPPO: Image-and-Language Understanding from Pixels Only	CVPR 2023	Code
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-training	CVPR 2023	-
DeAR: Debiasing Vision-Language Models with Additive Residuals	CVPR 2023	-
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training	CVPR 2023	Code
LaCLIP: Improving CLIP Training with Language Rewrites	NeurIPS 2023	Code
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption	ICCV 2023	Code
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training	ICCV 2023	-
CLIPpy: Perceptual Grouping in Contrastive Vision-Language Models	ICCV 2023	-
Efficient Vision-Language Pre-training by Cluster Masking	CVPR 2024	Code
ViTamin: Designing Scalable Vision Models in the Vision-Language Era	CVPR 2024	Code
Iterated Learning Improves Compositionality in Large Vision-Language Models	CVPR 2024	-
FairCLIP: Harnessing Fairness in Vision-Language Learning	CVPR 2024	Code
Retrieval-Enhanced Contrastive Vision-Text Models	ICLR 2024	-
CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions]	arXiv 2024	Code
Sigmoid Loss for Language Image Pre-Training	CVPR 2023	Code

Pre-training with Generative Objective

Paper	Published in	Code/Project
FLAVA: A Foundational Language And Vision Alignment Model	CVPR 2022	Code
CoCa: Contrastive Captioners are Image-Text Foundation Models	arXiv 2022	Code
Too Large; Data Reduction for Vision-Language Pre-Training	arXiv 2023	Code
SAM: Segment Anything	arXiv 2023	Code
SEEM: Segment Everything Everywhere All at Once	arXiv 2023	Code
Semantic-SAM: Segment and Recognize Anything at Any Granularity	arXiv 2023	Code
Generative Region-Language Pretraining for Open-Ended Object Detection	CVPR 2024	Code
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	CVPR 2024	Code
VILA: On Pre-training for Visual Language Models	CVPR 2024	-
Enhancing Vision-Language Pre-training with Rich Supervisions	CVPR 2024	-
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization	ICLR 2024	Code
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	ICLR 2024	Code

Pre-training with Alignment Objective

Paper	Published in	Code/Project
GLIP: Grounded Language-Image Pre-training	CVPR 2022	Code
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection	NeurIPS 2022	-
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-Training	CVPR 2023	Code
Do Vision and Language Encoders Represent the World Similarly?	CVPR 2024	Code
Non-autoregressive Sequence-to-Sequence Vision-Language Models	CVPR 2024	-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback	CVPR 2024	Code
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness	arXiv 2024	Code

Vision-Language Model Transfer Learning Methods

Transfer with Prompt Tuning

Transfer with Text Prompt Tuning

Paper	Published in	Code/Project
CoOp: Learning to Prompt for Vision-Language Models	IJCV 2022	Code
CoCoOp: Conditional Prompt Learning for Vision-Language Models	CVPR 2022	Code
ProDA: Prompt Distribution Learning	CVPR 2022	-
DenseClip: Language-Guided Dense Prediction with Context-Aware Prompting	CVPR 2022	Code
TPT: Test-time prompt tuning for zero-shot generalization in vision-language models	NeurIPS 2022	Code
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited Annotations	NeurIPS 2022	Code
CPL: Counterfactual Prompt Learning for Vision and Language Models	EMNLP 2022	Code
Bayesian Prompt Learning for Image-Language Model Generalization	arXiv 2022	-
UPL: Unsupervised Prompt Learning for Vision-Language Models	arXiv 2022	Code
ProGrad: Prompt-aligned Gradient for Prompt Tuning	arXiv 2022	Code
SoftCPT: Prompt Tuning with Soft Context Sharing for Vision-Language Models	arXiv 2022	Code
SubPT: Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models	TCSVT 2023	Code
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models	CVPR 2023	Code
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition	ACLW 2024	Code
Texts as Images in Prompt Tuning for Multi-Label Image Recognition	CVPR 2023	code
Visual-Language Prompt Tuning with Knowledge-guided Context Optimization	CVPR 2023	Code
Learning to Name Classes for Vision and Language Models	CVPR 2023	-
PLOT: Prompt Learning with Optimal Transport for Vision-Language Models	ICLR 2023	Code
CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classification	ICCV 2023	Code
ProTeCt: Prompt Tuning for Hierarchical Consistency	arXiv 2023	-
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt Tuning	arXiv 2023	Code
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?	ICCV 2023	Code
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models	ICCV 2023	-
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models	ICCV 2023	-
Read-only Prompt Optimization for Vision-Language Few-shot Learning	ICCV 2023	Code
Bayesian Prompt Learning for Image-Language Model Generalization	ICCV 2023	Code
Distribution-Aware Prompt Tuning for Vision-Language Models	ICCV 2023	Code
LPT: Long-Tailed Prompt Tuning For Image Classification	ICCV 2023	Code
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning	ICCV 2023	Code
Efficient Test-Time Prompt Tuning for Vision-Language Models	arXiv 2024	-
Text-driven Prompt Generation for Vision-Language Models in Federated Learning	ICLR 2024	-
C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion	ICLR 2024	-
Prompt Gradient Projection for Continual Learning	ICLR 2024	-
Nemesis: Normalizing the soft-prompt vectors of vision-language models	ICLR 2024	Code
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning	ICLR 2024	Code
TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model	CVPR 2024	Code
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models	CVPR 2024	Code
Any-Shift Prompting for Generalization over Distributions	CVPR 2024	-
Towards Better Vision-Inspired Vision-Language Models	CVPR 2024	-
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models	ECCV 2024	Code

Transfer with Visual Prompt Tuning

Paper	Published in	Code/Project
Exploring Visual Prompts for Adapting Large-Scale Models	arXiv 2022	Code
Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification	arXiv 2023	-
Fine-Grained Visual Prompting	arXiv 2023	-
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models	ICCV 2023	Code
Progressive Visual Prompt Learning with Contrastive Feature Re-formation	IJCV 2024	Code
Visual In-Context Prompting	CVPR 2024	Code
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance	ECCV 2024	Code

Transfer with Text and Visual Prompt Tuning

Paper	Published in	Code/Project
UPT: Unified Vision and Language Prompt Learning	arXiv 2022	Code
MVLPT: Multitask Vision-Language Prompt Tuning	arXiv 2022	Code
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained Model	arXiv 2022	Code
MaPLe: Multi-modal Prompt Learning	CVPR 2023	Code
Learning to Prompt Segment Anything Models	arXiv 2024	-
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts	ECCV 2024	Code
An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models	ICLR 2024	-
GalLoP: Learning Global and Local Prompts for Vision-Language Models	ECCV 2024	-
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts	ECCV 2024	Code

Transfer with Feature Adapter

Paper	Published in	Code/Project
Clip-Adapter: Better Vision-Language Models with Feature Adapters	arXiv 2021	Code
Tip-Adapte: Training-free Adaption of CLIP for Few-shot Classification	ECCV 2022	Code
SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained Models	BMVC 2022	Code
CLIPPR: Improving Zero-Shot Models with Label Distribution Priors	arXiv 2022	Code
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification	arXiv 2022	-
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models	ICCV 2023	Code
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control	ICCV 2023	Code
SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More	arXiv 2023	Code
Segment Anything in High Quality	arXiv 2023	Code
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding	COLING 2025	Code
CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language Models	arXiv 2023	-
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation	NeurIPS 2024	Code

Transfer with Other Methods

Paper	Published in	Code/Project
VT-Clip: Enhancing Vision-Language Models with Visual-guided Texts	arXiv 2021	-
Wise-FT: Robust fine-tuning of zero-shot models	CVPR 2022	Code
MaskCLIP: Extract Free Dense Labels from CLIP	ECCV 2022	Code
MUST: Masked Unsupervised Self-training for Label-free Image Classification	ICLR 2023	Code
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention	AAAI 2023	Code
Semantic Prompt for Few-Shot Image Recognition	CVPR 2023	-
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners	CVPR 2023	Code
Task Residual for Tuning Vision-Language Models	CVPR 2023	Code
Deeply Coupled Cross-Modal Prompt Learning	ACL 2023	Code
Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation	arXiv 2023	-
Personalize Segment Anything Model with One Shot	arXiv 2023	Code
Chils: Zero-shot image classification with hierarchical label sets	ICML 2023	Code
Improving Zero-shot Generalization and Robustness of Multi-modal Models	CVPR 2023	Code
Exploiting Category Names for Few-Shot Classification with Vision-Language Models	ICLR W 2023	-
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models	arXiv 2023	Code
Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models	ICCV 2023	Code
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization	ICCV 2023	Code
PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation	ICCV 2023	-
Black Box Few-Shot Adaptation for Vision-Language models	ICCV 2023	Code
AD-CLIP: Adapting Domains in Prompt Space Using CLIP	ICCVW 2023	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	arXiv 2023	Code
Language Models as Black-Box Optimizers for Vision-Language Models	arXiv 2023	-
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching	ICLR 2024	Code
Consistency-guided Prompt Learning for Vision-Language Models	ICLR 2024	-
Efficient Test-Time Adaptation of Vision-Language Models	CVPR 2024	Code
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models	CVPR 2024	Code
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models	CVPR 2024	Code
Anchor-based Robust Finetuning of Vision-Language Models	CVPR 2024
Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners	CVPR 2024	Code

Vision-Language Model Knowledge Distillation Methods

Knowledge Distillation for Object Detection

Paper	Published in	Code/Project
ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge Distillation	ICLR 2022	Code
DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model	CVPR 2022	Code
XPM: Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling	CVPR 2022	Code
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection	NeurIPS 2022	Code
PromptDet: Towards Open-vocabulary Detection using Uncurated Images	ECCV 2022	Code
PB-OVD: Open Vocabulary Object Detection with Pseudo Bounding-Box Labels	ECCV 2022	Code
OV-DETR: Open-Vocabulary DETR with Conditional Matching	ECCV 2022	Code
Detic: Detecting Twenty-thousand Classes using Image-level Supervision	ECCV 2022	Code
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision Transformers	ECCV 2022	Code
VL-PLM: Exploiting Unlabeled Data with Vision and Language Models for Object Detection	ECCV 2022	Code
ZSD-YOLO: Zero-shot Object Detection Through Vision-Language Embedding Alignment	arXiv 2022	Code
HierKD: Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge Distillation	arXiv 2022	Code
VLDet: Learning Object-Language Alignments for Open-Vocabulary Object Detection	ICLR 2023	Code
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models	ICLR 2023	Code
CondHead: Learning to Detect and Segment for Open Vocabulary Object Detection	CVPR 2023	-
Aligning Bag of Regions for Open-Vocabulary Object Detection	CVPR 2023	Code
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers	CVPR 2023	Code
Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection	CVPR 2023	Code
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching	CVPR 2023	Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment	CVPR 2023	-
Detecting Everything in the Open World: Towards Universal Object Detection	CVPR 2023	Code
CapDet: Unifying Dense Captioning and Open-World Detection Pretraining	CVPR 2023	-
Contextual Object Detection with Multimodal Large Language Models	arXiv 2023	Code
Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image Models	arXiv 2023	Code
EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment	ICCV 2023	Code
Improving Pseudo Labels for Open-Vocabulary Object Detection	arXiv 2023	-
RegionGPT: Towards Region Understanding Vision Language Model	CVPR 2024	Code
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors	ICLR 2024	-
Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction	ICLR 2024	-

Knowledge Distillation for Semantic Segmentation

Paper	Published in	Code/Project
SSIW: Semantic Segmentation In-the-Wild Without Seeing Any Segmentation Examples	arXiv 2021	-
ReCo: Retrieve and Co-segment for Zero-shot Transfer	NeurIPS 2022	Code
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation	CVPR 2022	Code
CLIPSeg: Image Segmentation Using Text and Image Prompts	CVPR 2022	Code
ZegFormer: Decoupling Zero-Shot Semantic Segmentation	CVPR 2022	Code
LSeg: Language-driven Semantic Segmentation	ICLR 2022	Code
ZSSeg: A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language Model	ECCV 2022	Code
OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level Labels	ECCV 2022	Code
Fusioner: Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models	BMVC 2022	Code
OVSeg: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP	CVPR 2023	Code
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation	CVPR 2023	Code
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation	CVPR 2023	Code
FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation	CVPR 2023	Code
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask Annotations	CVPR 2023	Code
Exploring Open-Vocabulary Semantic Segmentation without Human Labels	arXiv 2023	-
OpenVIS: Open-vocabulary Video Instance Segmentation	arXiv 2023	-
Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic Segmentation	arXiv 2023	-
Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic Segmentation	arXiv 2023	Code
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models	arXiv 2023	-
SegPrompt: Boosting Open-World Segmentation via Category-level Prompt Learning	ICCV 2023	Code
ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic Segmentation	arXiv 2023	-
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP	arXiv 2023	Code
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language Models	arXiv 2023	-
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction	ICLR 2024	-

Knowledge Distillation for Other Tasks

Paper	Published in	Code/Project
Controlling Vision-Language Models for Universal Image Restoration	arXiv 2023	Code
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition	ICLR 2024	Project
AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection	ICLR 2024	Code