Home

Awesome

Awesome Vision-Language Models Awesome

<img src="./images/overview.png" width="96%" height="96%">

This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:

Vision-Language Models for Vision Tasks: A Survey [Paper]

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

🤩 Our paper is selected into TPAMI Top 50 Popular Paper List !!

arXiv Maintenance PR's Welcome

<!-- [![made-with-Markdown](https://img.shields.io/badge/Made%20with-Markdown-1f425f.svg)](http://commonmark.org) --> <!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) -->

Feel free to pull requests or contact us if you find any related papers that are not included here.

The process to submit a pull request is as follows:

  |[Title](Paper Link)|Conference|[Code/Project](Code/Project link)|

🔥 News

Last update on 2024/11/03

VLM Pre-training Methods

VLM Transfer Learning Methods

VLM Knowledge Distillation for Detection

VLM Knowledge Distillation for Segmentation

VLM Knowledge Distillation for Other Vision Tasks

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Citation

If you find our work useful in your research, please consider citing:

@article{zhang2024vision,
  title={Vision-language models for vision tasks: A survey},
  author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

Menu

Datasets

Datasets for VLM Pre-training

DatasetYearNum of Image-Text ParisLanguageProject
SBU Caption20111MEnglishProject
COCO Caption20161.5MEnglishProject
Yahoo Flickr Creative Commons 100 Million2016100MEnglishProject
Visual Genome20175.4MEnglishProject
Conceptual Captions 3M20183.3MEnglishProject
Localized Narratives20200.87MEnglishProject
Conceptual 12M202112MEnglishProject
Wikipedia-based Image Text202137.6M108 LanguagesProject
Red Caps202112MEnglishProject
LAION400M2021400MEnglishProject
LAION5B20225BOver 100 LanguagesProject
WuKong2022100MChineseProject
CLIP2021400MEnglish-
ALIGN20211.8BEnglish-
FILIP2021300MEnglish-
WebLI202212BEnglish-

Datasets for VLM Evaluation

Image Classification

DatasetYearClassesTrainingTestingEvaluation MetricProject
MNIST19981060,00010,000AccuracyProject
Caltech-10120041023,0606,085Mean Per ClassProject
PASCAL VOC 20072007205,0114,95211-point mAPProject
Oxford 102 Flowers20081022,0406,149Mean Per ClassProject
CIFAR-1020091050,00010,000AccuracyProject
CIFAR-100200910050,00010,000AccuracyProject
ImageNet-1k200910001,281,16750,000AccuracyProject
SUN397201039719,85019,850AccuracyProject
SVHN20111073,25726,032AccuracyProject
STL-102011101,0008,000AccuracyProject
GTSRB20114326,64012,630AccuracyProject
KITTI Distance201246,770711AccuracyProject
IIIT5k2012362,0003,000AccuracyProject
Oxford-IIIT PETS2012373,6803,669Mean Per ClassProject
Stanford Cars20131968,1448,041AccuracyProject
FGVC Aircraft20131006,6673,333Mean Per ClassProject
Facial Emotion2013832,1403,574AccuracyProject
Rendered SST2201327,7921,821AccuracyProject
Describable Textures2014473,7601,880AccuracyProject
Food-101201410175,75025,250AccuracyProject
Birdsnap201450042,2832,149AccuracyProject
RESISC452017453,15025,200AccuracyProject
CLEVR Counts201782,000500AccuracyProject
PatchCamelyon20182294,91232,768AccuracyProject
EuroSAT20191010,0005,000AccuracyProject
Hateful Memes202028,500500ROC AUCProject
Country211202121143,20021,100AccuracyProject

Image-Text Retrieval

DatasetYearClassesTrainingTestingEvaluation MetricProject
Flickr30k2014-31,783-RecallProject
COCO Caption2015-82,7835,000RecallProject

Action Recognition

DatasetYearClassesTrainingTestingEvaluation MetricProject
UCF10120121019,5371,794AccuracyProject
Kinetics7002019700494,80131,669Mean (top1, top5)Project
RareAct20201227,607-mWAP, mSAPProject

Object Detection

DatasetYearClassesTrainingTestingEvaluation MetricProject
COCO 2014 Detection20148083,00041,000Box mAPProject
COCO 2017 Detection201780118,0005,000Box mAPProject
LVIS20191203118,0005,000Box mAPProject
ODinW2022314132,41320,070Box mAPProject

Semantic Segmentation

DatasetYearClassesTrainingTestingEvaluation MetricProject
PASCAL VOC 20122012201,4641,449mIoUProject
PASCAL Content20144594,9985,105mIoUProject
Cityscapes2016192,975500mIoUProject
ADE20k201715025,5742,000mIoUProject

Vision-Language Pre-training Methods

Pre-training with Contrastive Objective

PaperPublished inCode/Project
CLIP: Learning Transferable Visual Models From Natural Language SupervisionICML 2021Code
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text SupervisionICML 2021-
OTTER: Data Efficient Language-Supervised Zero-Shot Recognition with Optimal Transport DistillationarXiv 2021Code
Florence: A New Foundation Model for Computer VisionarXiv 2021-
RegionClip: Region-based Language-Image PretrainingarXiv 2021Code
DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training ParadigmICLR 2022Code
FILIP: Fine-grained Interactive Language-Image Pre-TrainingICLR 2022-
KELIP: Large-scale Bilingual Language-Image Contrastive LearningICLRW 2022Code
ZeroVL: Contrastive Vision-Language Pre-training with Limited ResourcesECCV 2022Code
SLIP: Self-supervision meets Language-Image Pre-trainingECCV 2022Code
UniCL: Unified Contrastive Learning in Image-Text-Label SpaceCVPR 2022Code
LiT: Zero-Shot Transfer with Locked-image text TuningCVPR 2022Code
GroupViT: Semantic Segmentation Emerges from Text SupervisionCVPR 2022Code
PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model PretrainingNeurIPS 2022-
UniCLIP: Unified Framework for Contrastive Language-Image Pre-trainingNeurIPS 2022-
K-LITE: Learning Transferable Visual Models with External KnowledgeNeurIPS 2022Code
FIBER: Coarse-to-Fine Vision-Language Pre-training with Fusion in the BackboneNeurIPS 2022Code
Chinese CLIP: Contrastive Vision-Language Pretraining in ChinesearXiv 2022Code
AltCLIP: Altering the Language Encoder in CLIP for Extended Language CapabilitiesarXiv 2022Code
SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic SegmentationarXiv 2022Code
NLIP: Noise-robust Language-Image Pre-trainingAAAI 2023-
PaLI: A Jointly-Scaled Multilingual Language-Image ModelICLR 2023Project
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware AttentionICLR 2023Code
CLIPPO: Image-and-Language Understanding from Pixels OnlyCVPR 2023Code
RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-trainingCVPR 2023-
DeAR: Debiasing Vision-Language Models with Additive ResidualsCVPR 2023-
Filtering, Distillation, and Hard Negatives for Vision-Language Pre-TrainingCVPR 2023Code
LaCLIP: Improving CLIP Training with Language RewritesNeurIPS 2023Code
ALIP: Adaptive Language-Image Pre-training with Synthetic CaptionICCV 2023Code
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-trainingICCV 2023-
CLIPpy: Perceptual Grouping in Contrastive Vision-Language ModelsICCV 2023-
Efficient Vision-Language Pre-training by Cluster MaskingCVPR 2024Code
ViTamin: Designing Scalable Vision Models in the Vision-Language EraCVPR 2024Code
Iterated Learning Improves Compositionality in Large Vision-Language ModelsCVPR 2024-
FairCLIP: Harnessing Fairness in Vision-Language LearningCVPR 2024Code
Retrieval-Enhanced Contrastive Vision-Text ModelsICLR 2024-

Pre-training with Generative Objective

PaperPublished inCode/Project
FLAVA: A Foundational Language And Vision Alignment ModelCVPR 2022Code
CoCa: Contrastive Captioners are Image-Text Foundation ModelsarXiv 2022Code
Too Large; Data Reduction for Vision-Language Pre-TrainingarXiv 2023Code
SAM: Segment AnythingarXiv 2023Code
SEEM: Segment Everything Everywhere All at OncearXiv 2023Code
Semantic-SAM: Segment and Recognize Anything at Any GranularityarXiv 2023Code
Generative Region-Language Pretraining for Open-Ended Object DetectionCVPR 2024Code
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksCVPR 2024Code
VILA: On Pre-training for Visual Language ModelsCVPR 2024-
Enhancing Vision-Language Pre-training with Rich SupervisionsCVPR 2024-
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual TokenizationICLR 2024Code
MMICL: Empowering Vision-language Model with Multi-Modal In-Context LearningICLR 2024Code

Pre-training with Alignment Objective

PaperPublished inCode/Project
GLIP: Grounded Language-Image Pre-trainingCVPR 2022Code
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world DetectionNeurIPS 2022-
nCLIP: Non-Contrastive Learning Meets Language-Image Pre-TrainingCVPR 2023Code
Do Vision and Language Encoders Represent the World Similarly?CVPR 2024Code
Non-autoregressive Sequence-to-Sequence Vision-Language ModelsCVPR 2024-
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human FeedbackCVPR 2024Code
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V TrustworthinessarXiv 2024Code

Vision-Language Model Transfer Learning Methods

Transfer with Prompt Tuning

Transfer with Text Prompt Tuning

PaperPublished inCode/Project
CoOp: Learning to Prompt for Vision-Language ModelsIJCV 2022Code
CoCoOp: Conditional Prompt Learning for Vision-Language ModelsCVPR 2022Code
ProDA: Prompt Distribution LearningCVPR 2022-
DenseClip: Language-Guided Dense Prediction with Context-Aware PromptingCVPR 2022Code
TPT: Test-time prompt tuning for zero-shot generalization in vision-language modelsNeurIPS 2022Code
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited AnnotationsNeurIPS 2022Code
CPL: Counterfactual Prompt Learning for Vision and Language ModelsEMNLP 2022Code
Bayesian Prompt Learning for Image-Language Model GeneralizationarXiv 2022-
UPL: Unsupervised Prompt Learning for Vision-Language ModelsarXiv 2022Code
ProGrad: Prompt-aligned Gradient for Prompt TuningarXiv 2022Code
SoftCPT: Prompt Tuning with Soft Context Sharing for Vision-Language ModelsarXiv 2022Code
SubPT: Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language ModelsTCSVT 2023Code
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language ModelsCVPR 2023Code
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual RecognitionarXiv 2023Code
Texts as Images in Prompt Tuning for Multi-Label Image RecognitionCVPR 2023code
Visual-Language Prompt Tuning with Knowledge-guided Context OptimizationCVPR 2023Code
Learning to Name Classes for Vision and Language ModelsCVPR 2023-
PLOT: Prompt Learning with Optimal Transport for Vision-Language ModelsICLR 2023Code
CuPL: What does a platypus look like? Generating customized prompts for zero-shot image classificationICCV 2023Code
ProTeCt: Prompt Tuning for Hierarchical ConsistencyarXiv 2023-
Enhancing CLIP with CLIP: Exploring Pseudolabeling for Limited-Label Prompt TuningarXiv 2023Code
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels?ICCV 2023Code
Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language ModelsICCV 2023-
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language ModelsICCV 2023-
Read-only Prompt Optimization for Vision-Language Few-shot LearningICCV 2023Code
Bayesian Prompt Learning for Image-Language Model GeneralizationICCV 2023Code
Distribution-Aware Prompt Tuning for Vision-Language ModelsICCV 2023Code
LPT: Long-Tailed Prompt Tuning For Image ClassificationICCV 2023Code
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt TuningICCV 2023Code
Efficient Test-Time Prompt Tuning for Vision-Language ModelsarXiv 2024-
Text-driven Prompt Generation for Vision-Language Models in Federated LearningICLR 2024-
C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature DispersionICLR 2024-
Prompt Gradient Projection for Continual LearningICLR 2024-
Nemesis: Normalizing the soft-prompt vectors of vision-language modelsICLR 2024Code
DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuningICLR 2024Code
TCP:Textual-based Class-aware Prompt tuning for Visual-Language ModelCVPR 2024Code
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language ModelsCVPR 2024Code
Any-Shift Prompting for Generalization over DistributionsCVPR 2024-
Towards Better Vision-Inspired Vision-Language ModelsCVPR 2024-
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language ModelsECCV 2024Code

Transfer with Visual Prompt Tuning

PaperPublished inCode/Project
Exploring Visual Prompts for Adapting Large-Scale ModelsarXiv 2022Code
Retrieval-Enhanced Visual Prompt Learning for Few-shot ClassificationarXiv 2023-
Fine-Grained Visual PromptingarXiv 2023-
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language ModelsICCV 2023Code
Progressive Visual Prompt Learning with Contrastive Feature Re-formationIJCV 2024Code
Visual In-Context PromptingCVPR 2024Code
FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot PerformanceECCV 2024Code

Transfer with Text and Visual Prompt Tuning

PaperPublished inCode/Project
UPT: Unified Vision and Language Prompt LearningarXiv 2022Code
MVLPT: Multitask Vision-Language Prompt TuningarXiv 2022Code
CAVPT: Dual Modality Prompt Tuning for Vision-Language Pre-Trained ModelarXiv 2022Code
MaPLe: Multi-modal Prompt LearningCVPR 2023Code
Learning to Prompt Segment Anything ModelsarXiv 2024-
CLAP: Isolating Content from Style through Contrastive Learning with Augmented PromptsECCV 2024Code
An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language ModelsICLR 2024-
GalLoP: Learning Global and Local Prompts for Vision-Language ModelsECCV 2024-
CLAP: Isolating Content from Style through Contrastive Learning with Augmented PromptsECCV 2024Code

Transfer with Feature Adapter

PaperPublished inCode/Project
Clip-Adapter: Better Vision-Language Models with Feature AdaptersarXiv 2021Code
Tip-Adapte: Training-free Adaption of CLIP for Few-shot ClassificationECCV 2022Code
SVL-Adapter: Self-Supervised Adapter for Vision-Language Pretrained ModelsBMVC 2022Code
CLIPPR: Improving Zero-Shot Models with Label Distribution PriorsarXiv 2022Code
SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image ClassificationarXiv 2022-
SuS-X: Training-Free Name-Only Transfer of Vision-Language ModelsICCV 2023Code
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity ControlICCV 2023Code
SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and MorearXiv 2023Code
Segment Anything in High QualityarXiv 2023Code
HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical UnderstandingarXiv 2023Code
CLAP: Contrastive Learning with Augmented Prompts for Robustness on Pretrained Vision-Language ModelsarXiv 2023-
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and TransportationNeurIPS 2024Code

Transfer with Other Methods

PaperPublished inCode/Project
VT-Clip: Enhancing Vision-Language Models with Visual-guided TextsarXiv 2021-
Wise-FT: Robust fine-tuning of zero-shot modelsCVPR 2022Code
MaskCLIP: Extract Free Dense Labels from CLIPECCV 2022Code
MUST: Masked Unsupervised Self-training for Label-free Image ClassificationICLR 2023Code
CALIP: Zero-Shot Enhancement of CLIP with Parameter-free AttentionAAAI 2023Code
Semantic Prompt for Few-Shot Image RecognitionCVPR 2023-
Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot LearnersCVPR 2023Code
Task Residual for Tuning Vision-Language ModelsCVPR 2023Code
Deeply Coupled Cross-Modal Prompt LearningACL 2023Code
Prompt Ensemble Self-training for Open-Vocabulary Domain AdaptationarXiv 2023-
Personalize Segment Anything Model with One ShotarXiv 2023Code
Chils: Zero-shot image classification with hierarchical label setsICML 2023Code
Improving Zero-shot Generalization and Robustness of Multi-modal ModelsCVPR 2023Code
Exploiting Category Names for Few-Shot Classification with Vision-Language ModelsICLR W 2023-
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language ModelsarXiv 2023Code
Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language ModelsICCV 2023Code
PromptStyler: Prompt-driven Style Generation for Source-free Domain GeneralizationICCV 2023Code
PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain AdaptationICCV 2023-
Black Box Few-Shot Adaptation for Vision-Language modelsICCV 2023Code
AD-CLIP: Adapting Domains in Prompt Space Using CLIPICCVW 2023-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction TuningarXiv 2023Code
Language Models as Black-Box Optimizers for Vision-Language ModelsarXiv 2023-
Matcher: Segment Anything with One Shot Using All-Purpose Feature MatchingICLR 2024Code
Consistency-guided Prompt Learning for Vision-Language ModelsICLR 2024-
Efficient Test-Time Adaptation of Vision-Language ModelsCVPR 2024Code
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language ModelsCVPR 2024Code
A Closer Look at the Few-Shot Adaptation of Large Vision-Language ModelsCVPR 2024Code
Anchor-based Robust Finetuning of Vision-Language ModelsCVPR 2024
Pre-trained Vision and Language Transformers Are Few-Shot Incremental LearnersCVPR 2024Code

Vision-Language Model Knowledge Distillation Methods

Knowledge Distillation for Object Detection

PaperPublished inCode/Project
ViLD: Open-vocabulary Object Detection via Vision and Language Knowledge DistillationICLR 2022Code
DetPro: Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language ModelCVPR 2022Code
XPM: Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-LabelingCVPR 2022Code
Bridging the Gap between Object and Image-level Representations for Open-Vocabulary DetectionNeurIPS 2022Code
PromptDet: Towards Open-vocabulary Detection using Uncurated ImagesECCV 2022Code
PB-OVD: Open Vocabulary Object Detection with Pseudo Bounding-Box LabelsECCV 2022Code
OV-DETR: Open-Vocabulary DETR with Conditional MatchingECCV 2022Code
Detic: Detecting Twenty-thousand Classes using Image-level SupervisionECCV 2022Code
OWL-ViT: Simple Open-Vocabulary Object Detection with Vision TransformersECCV 2022Code
VL-PLM: Exploiting Unlabeled Data with Vision and Language Models for Object DetectionECCV 2022Code
ZSD-YOLO: Zero-shot Object Detection Through Vision-Language Embedding AlignmentarXiv 2022Code
HierKD: Open-Vocabulary One-Stage Detection with Hierarchical Visual-Language Knowledge DistillationarXiv 2022Code
VLDet: Learning Object-Language Alignments for Open-Vocabulary Object DetectionICLR 2023Code
F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language ModelsICLR 2023Code
CondHead: Learning to Detect and Segment for Open Vocabulary Object DetectionCVPR 2023-
Aligning Bag of Regions for Open-Vocabulary Object DetectionCVPR 2023Code
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision TransformersCVPR 2023Code
Object-Aware Distillation Pyramid for Open-Vocabulary Object DetectionCVPR 2023Code
CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-MatchingCVPR 2023Code
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region AlignmentCVPR 2023-
Detecting Everything in the Open World: Towards Universal Object DetectionCVPR 2023Code
CapDet: Unifying Dense Captioning and Open-World Detection PretrainingCVPR 2023-
Contextual Object Detection with Multimodal Large Language ModelsarXiv 2023Code
Building One-class Detector for Anything: Open-vocabulary Zero-shot OOD Detection Using Text-image ModelsarXiv 2023Code
EdaDet: Open-Vocabulary Object Detection Using Early Dense AlignmentICCV 2023Code
Improving Pseudo Labels for Open-Vocabulary Object DetectionarXiv 2023-
RegionGPT: Towards Region Understanding Vision Language ModelCVPR 2024Code
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained DescriptorsICLR 2024-
Ins-DetCLIP: Aligning Detection Model to Follow Human-Language InstructionICLR 2024-

Knowledge Distillation for Semantic Segmentation

PaperPublished inCode/Project
SSIW: Semantic Segmentation In-the-Wild Without Seeing Any Segmentation ExamplesarXiv 2021-
ReCo: Retrieve and Co-segment for Zero-shot TransferNeurIPS 2022Code
CLIMS: Cross Language Image Matching for Weakly Supervised Semantic SegmentationCVPR 2022Code
CLIPSeg: Image Segmentation Using Text and Image PromptsCVPR 2022Code
ZegFormer: Decoupling Zero-Shot Semantic SegmentationCVPR 2022Code
LSeg: Language-driven Semantic SegmentationICLR 2022Code
ZSSeg: A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-language ModelECCV 2022Code
OpenSeg: Scaling Open-Vocabulary Image Segmentation with Image-Level LabelsECCV 2022Code
Fusioner: Open-vocabulary Semantic Segmentation with Frozen Vision-Language ModelsBMVC 2022Code
OVSeg: Open-Vocabulary Semantic Segmentation with Mask-adapted CLIPCVPR 2023Code
ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic SegmentationCVPR 2023Code
CLIP is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic SegmentationCVPR 2023Code
FreeSeg: Unified, Universal and Open-Vocabulary Image SegmentationCVPR 2023Code
Mask-free OVIS: Open-Vocabulary Instance Segmentation without Manual Mask AnnotationsCVPR 2023Code
Exploring Open-Vocabulary Semantic Segmentation without Human LabelsarXiv 2023-
OpenVIS: Open-vocabulary Video Instance SegmentationarXiv 2023-
Segment Anything is A Good Pseudo-label Generator for Weakly Supervised Semantic SegmentationarXiv 2023-
Segment Anything Model (SAM) Enhanced Pseudo Labels for Weakly Supervised Semantic SegmentationarXiv 2023Code
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language ModelsarXiv 2023-
SegPrompt: Boosting Open-World Segmentation via Category-level Prompt LearningICCV 2023Code
ICPC: Instance-Conditioned Prompting with Contrastive Learning for Semantic SegmentationarXiv 2023-
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIParXiv 2023Code
Plug-and-Play, Dense-Label-Free Extraction of Open-Vocabulary Semantic Segmentation from Vision-Language ModelsarXiv 2023-
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense PredictionICLR 2024-

Knowledge Distillation for Other Tasks

PaperPublished inCode/Project
Controlling Vision-Language Models for Universal Image RestorationarXiv 2023Code
FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action RecognitionICLR 2024Project
AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly DetectionICLR 2024Code