Awesome
Awesome Vision-Language Models
<img src="./images/overview.png" width="96%" height="96%">This is the repository of Vision Language Models for Vision Tasks: a Survey, a systematic survey of VLM studies in various visual recognition tasks including image classification, object detection, semantic segmentation, etc. For details, please refer to:
Vision-Language Models for Vision Tasks: A Survey [Paper]
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
🤩 Our paper is selected into TPAMI Top 50 Popular Paper List !!
<!-- [![made-with-Markdown](https://img.shields.io/badge/Made%20with-Markdown-1f425f.svg)](http://commonmark.org) --> <!-- [![Documentation Status](https://readthedocs.org/projects/ansicolortags/badge/?version=latest)](http://ansicolortags.readthedocs.io/?badge=latest) -->Feel free to pull requests or contact us if you find any related papers that are not included here.
The process to submit a pull request is as follows:
- a. Fork the project into your own repository.
- b. Add the Title, Paper link, Conference, Project/Code link in
README.md
using the following format:
|[Title](Paper Link)|Conference|[Code/Project](Code/Project link)|
- c. Submit the pull request to this branch.
🔥 News
Last update on 2024/11/03
VLM Pre-training Methods
- [arXiv 2024] RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [Paper][Code]
- [CVPR 2024] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback [Paper][Code]
- [CVPR 2024] Do Vision and Language Encoders Represent the World Similarly? [Paper][Code]
- [CVPR 2024] Efficient Vision-Language Pre-training by Cluster Masking [Paper][Code]
- [CVPR 2024] Non-autoregressive Sequence-to-Sequence Vision-Language Models [Paper]
- [CVPR 2024] ViTamin: Designing Scalable Vision Models in the Vision-Language Era [Paper][Code]
- [CVPR 2024] Iterated Learning Improves Compositionality in Large Vision-Language Models [Paper]
- [CVPR 2024] FairCLIP: Harnessing Fairness in Vision-Language Learning [Paper][Code]
- [CVPR 2024] InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [Paper][Code]
- [CVPR 2024] VILA: On Pre-training for Visual Language Models [Paper]
- [CVPR 2024] Generative Region-Language Pretraining for Open-Ended Object Detection [Paper][Code]
- [CVPR 2024] Enhancing Vision-Language Pre-training with Rich Supervisions [Paper]
- [ICLR 2024] Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [Paper][Code]
- [ICLR 2024] MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning [Paper][Code]
- [ICLR 2024] Retrieval-Enhanced Contrastive Vision-Text Models [Paper]
VLM Transfer Learning Methods
- [NeurIPS 2024] Historical Test-time Prompt Tuning for Vision Foundation Models [Paper]
- [NeurIPS 2024] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation [Paper][Code]
- [IJCV 2024] Progressive Visual Prompt Learning with Contrastive Feature Re-formation [Paper][Code]
- [ECCV 2024] CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [Paper][Code]
- [ECCV 2024] FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance [Paper][Code]
- [ECCV 2024] GalLoP: Learning Global and Local Prompts for Vision-Language Models [Paper]
- [ECCV 2024] Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models [Paper][Code]
- [CVPR 2024] Towards Better Vision-Inspired Vision-Language Models [Paper]
- [CVPR 2024] One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models [Paper][Code]
- [CVPR 2024] Any-Shift Prompting for Generalization over Distributions [Paper]
- [CVPR 2024] A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models [Paper][Code]
- [CVPR 2024] Anchor-based Robust Finetuning of Vision-Language Models [Paper]
- [CVPR 2024] Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners [Paper][Code]
- [CVPR 2024] Visual In-Context Prompting [Paper][Code]
- [CVPR 2024] TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model [Paper][Code]
- [CVPR 2024] Efficient Test-Time Adaptation of Vision-Language Models [Paper][Code]
- [CVPR 2024] Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models [Paper][Code]
- [ICLR 2024] DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning [Paper][Code]
- [ICLR 2024] Nemesis: Normalizing the soft-prompt vectors of vision-language models [Paper]
- [ICLR 2024] Prompt Gradient Projection for Continual Learning [Paper]
- [ICLR 2024] An Image Is Worth 1000 Lies: Transferability of Adversarial Images across Prompts on Vision-Language Models [Paper]
- [ICLR 2024] Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching [Paper][Code]
- [ICLR 2024] Text-driven Prompt Generation for Vision-Language Models in Federated Learning [Paper]
- [ICLR 2024] Consistency-guided Prompt Learning for Vision-Language Models [Paper]
- [ICLR 2024] C-TPT: Calibrated Test-Time Prompt Tuning for Vision-Language Models via Text Feature Dispersion [Paper]
- [arXiv 2024] Learning to Prompt Segment Anything Models [Paper]
VLM Knowledge Distillation for Detection
- [NeurIPS 2024] Open-Vocabulary Object Detection via Language Hierarchy [Paper]
- [CVPR 2024] RegionGPT: Towards Region Understanding Vision Language Model [Paper][Code]
- [ICLR 2024] LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [Paper]
- [ICLR 2024] Ins-DetCLIP: Aligning Detection Model to Follow Human-Language Instruction [Paper]
VLM Knowledge Distillation for Segmentation
- [ICLR 2024] CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction [Paper]
VLM Knowledge Distillation for Other Vision Tasks
- [ICLR 2024] FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition [Paper][Project]
- [ICLR 2024] AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection [Paper][Code]
Abstract
Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.
Citation
If you find our work useful in your research, please consider citing:
@article{zhang2024vision,
title={Vision-language models for vision tasks: A survey},
author={Zhang, Jingyi and Huang, Jiaxing and Jin, Sheng and Lu, Shijian},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}
}
Menu
- Datasets
- Vision-Language Pre-training Methods
- Vision-Language Model Transfer Learning Methods
- Vision-Language Model Knowledge Distillation Methods
Datasets
Datasets for VLM Pre-training
Dataset | Year | Num of Image-Text Paris | Language | Project |
---|---|---|---|---|
SBU Caption | 2011 | 1M | English | Project |
COCO Caption | 2016 | 1.5M | English | Project |
Yahoo Flickr Creative Commons 100 Million | 2016 | 100M | English | Project |
Visual Genome | 2017 | 5.4M | English | Project |
Conceptual Captions 3M | 2018 | 3.3M | English | Project |
Localized Narratives | 2020 | 0.87M | English | Project |
Conceptual 12M | 2021 | 12M | English | Project |
Wikipedia-based Image Text | 2021 | 37.6M | 108 Languages | Project |
Red Caps | 2021 | 12M | English | Project |
LAION400M | 2021 | 400M | English | Project |
LAION5B | 2022 | 5B | Over 100 Languages | Project |
WuKong | 2022 | 100M | Chinese | Project |
CLIP | 2021 | 400M | English | - |
ALIGN | 2021 | 1.8B | English | - |
FILIP | 2021 | 300M | English | - |
WebLI | 2022 | 12B | English | - |
Datasets for VLM Evaluation
Image Classification
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
MNIST | 1998 | 10 | 60,000 | 10,000 | Accuracy | Project |
Caltech-101 | 2004 | 102 | 3,060 | 6,085 | Mean Per Class | Project |
PASCAL VOC 2007 | 2007 | 20 | 5,011 | 4,952 | 11-point mAP | Project |
Oxford 102 Flowers | 2008 | 102 | 2,040 | 6,149 | Mean Per Class | Project |
CIFAR-10 | 2009 | 10 | 50,000 | 10,000 | Accuracy | Project |
CIFAR-100 | 2009 | 100 | 50,000 | 10,000 | Accuracy | Project |
ImageNet-1k | 2009 | 1000 | 1,281,167 | 50,000 | Accuracy | Project |
SUN397 | 2010 | 397 | 19,850 | 19,850 | Accuracy | Project |
SVHN | 2011 | 10 | 73,257 | 26,032 | Accuracy | Project |
STL-10 | 2011 | 10 | 1,000 | 8,000 | Accuracy | Project |
GTSRB | 2011 | 43 | 26,640 | 12,630 | Accuracy | Project |
KITTI Distance | 2012 | 4 | 6,770 | 711 | Accuracy | Project |
IIIT5k | 2012 | 36 | 2,000 | 3,000 | Accuracy | Project |
Oxford-IIIT PETS | 2012 | 37 | 3,680 | 3,669 | Mean Per Class | Project |
Stanford Cars | 2013 | 196 | 8,144 | 8,041 | Accuracy | Project |
FGVC Aircraft | 2013 | 100 | 6,667 | 3,333 | Mean Per Class | Project |
Facial Emotion | 2013 | 8 | 32,140 | 3,574 | Accuracy | Project |
Rendered SST2 | 2013 | 2 | 7,792 | 1,821 | Accuracy | Project |
Describable Textures | 2014 | 47 | 3,760 | 1,880 | Accuracy | Project |
Food-101 | 2014 | 101 | 75,750 | 25,250 | Accuracy | Project |
Birdsnap | 2014 | 500 | 42,283 | 2,149 | Accuracy | Project |
RESISC45 | 2017 | 45 | 3,150 | 25,200 | Accuracy | Project |
CLEVR Counts | 2017 | 8 | 2,000 | 500 | Accuracy | Project |
PatchCamelyon | 2018 | 2 | 294,912 | 32,768 | Accuracy | Project |
EuroSAT | 2019 | 10 | 10,000 | 5,000 | Accuracy | Project |
Hateful Memes | 2020 | 2 | 8,500 | 500 | ROC AUC | Project |
Country211 | 2021 | 211 | 43,200 | 21,100 | Accuracy | Project |
Image-Text Retrieval
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
Flickr30k | 2014 | - | 31,783 | - | Recall | Project |
COCO Caption | 2015 | - | 82,783 | 5,000 | Recall | Project |
Action Recognition
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
UCF101 | 2012 | 101 | 9,537 | 1,794 | Accuracy | Project |
Kinetics700 | 2019 | 700 | 494,801 | 31,669 | Mean (top1, top5) | Project |
RareAct | 2020 | 122 | 7,607 | - | mWAP, mSAP | Project |
Object Detection
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
COCO 2014 Detection | 2014 | 80 | 83,000 | 41,000 | Box mAP | Project |
COCO 2017 Detection | 2017 | 80 | 118,000 | 5,000 | Box mAP | Project |
LVIS | 2019 | 1203 | 118,000 | 5,000 | Box mAP | Project |
ODinW | 2022 | 314 | 132,413 | 20,070 | Box mAP | Project |
Semantic Segmentation
Dataset | Year | Classes | Training | Testing | Evaluation Metric | Project |
---|---|---|---|---|---|---|
PASCAL VOC 2012 | 2012 | 20 | 1,464 | 1,449 | mIoU | Project |
PASCAL Content | 2014 | 459 | 4,998 | 5,105 | mIoU | Project |
Cityscapes | 2016 | 19 | 2,975 | 500 | mIoU | Project |
ADE20k | 2017 | 150 | 25,574 | 2,000 | mIoU | Project |