Home

Awesome

Awesome Prompting on Vision-Language Models

<img src="./assets/pvlm-mindmap.png" width="100%" height="100%">

# :nerd_face: What is Prompting on Vision-Language Models?

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. This repo aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models (VLMs): multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion) (Fig. 1).

<img src="./assets/3-models.png"> <p align="center"> <i>Fig. 1: This work focuses on three main types of vision-language models.</i> </p>

Reference

This repo lists relevant papers summarized in our survey:

A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models. Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr. Preprint 2023. [pdf]

If you find our paper and repo helpful to your research, please cite the following paper:

@article{gu2023survey,
  title={A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models},
  author={Gu, Jindong and Han, Zhen and Chen, Shuo, and Beirami, Ahmad and He, Bailan and Zhang, Gengyuan and Liao, Ruotong and Qin, Yao and Tresp, Volker and Torr, Philip}
  journal={arXiv preprint arXiv:2307.12980},
  year={2023}
}

# :paperclips: Awesome Papers

Prompting Models in Multimodal-to-Text Generation (e.g. on Flamingo)

There are two main types of fusion module approaches based on the integration of visual and textual modalities: encoder-decoder as a multi-modal fusion module and decoder-only as a multi-modal fusion module. Prompting methods can be divided into two main categories (Fig. 2) based on the readability of the templates: hard prompt and soft prompt. Hard prompt encompasses four subcategories: task instruction, in-context learning, retrieval-based prompting, and chain-of-thought prompting. Soft prompts are classified into two strategies: prompt tuning and prefix token tuning, based on whether they internally add new tokens to the model's architecture or simply append them to the input. this study primarily concentrates on prompt methods that avoid altering the base model.

<img src="./assets/chapt3_prompting_method.png"> <p align="center"> <i>Fig. 2 : Classification of prompting methods.</i> </p>
TitleVenueYearCode if availableComment
Unifying Vision-and-Language Tasks via Text GenerationICML2021GithubEncoder-decoder fusion; Text prefixes as prompt
SimVLM: Simple Visual Language Model Pretraining with Weak SupervisionICLR2022GithubEncoder-decoder fusion; Text prefixes as prompt
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkICML2022GithubEncoder-decoder fusion; Text prefixes as prompt
PaLI: A Jointly-Scaled Multilingual Language-Image ModelICLR2023---Encoder-decoder fusion; Instruction prompt
Multimodal Few-Shot Learning with Frozen Language ModelsNeurIPS2021PageDecoder-only fusion; Image conditional prefix tuning
Flamingo: a Visual Language Model for Few-Shot LearningNeurIPS2022GithubDecoder-only fusion; Text prompts;
MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based FinetuningEMNLP2022GithubDecoder-only fusion; Image conditional prefix tuning
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsICML2023GithubDecoder-only fusion; Image conditional prefix tuning
Language Models are Unsupervised Multitask LearnersOpenAI Blog2019GithubTask instruction prompt
The Turking Test: Can Language Models Understand Instructions?arXiv2020---Task instruction prompt
Language Models are Few-Shot LearnersNeurIPS2020---In-context learning
Learning To Retrieve Prompts for In-Context LearningNAACL-HLT2022GithubRetrieval-based prompting
Unified Demonstration Retriever for In-Context LearningACL2023GithubRetrieval-based prompting
Compositional Exemplars for In-context LearningICML2023GithubRetrieval-based prompting
Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeurIPS2022---Chain-of-thought prompting
Automatic Chain of Thought Prompting in Large Language ModelsICLR2023GithubChain-of-thought prompting
The Power of Scale for Parameter-Efficient Prompt TuningEMNLP2021---Prompt tuning
Learning How to Ask: Querying LMs with Mixtures of Soft PromptsNAACL-HLT2021GithubPrompt tuning
Prefix-Tuning: Optimizing Continuous Prompts for GenerationACL2021GithubPrefix tuning
Prompt Tuning for Generative Multimodal Pretrained ModelsACL2023GithubPrompt tuning on OFA
Language Is Not All You Need: Aligning Perception with Language ModelsNeurIPS2023GithubTextual instruction prompts
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language ModelsNeurIPS2024PageRobustness of prompt tuning on VLMs
Towards Robust Prompts on Vision-Language ModelsNextGenAISafety@ICLR2024---Robustness of prompt tuning on VLMs
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningNeurIPS2023GithubPrompt tuning
Visual Instruction TuningNeurIPS2023Github
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and BeyondarXiv2023GithubPrompt tuning
Shikra: Unleashing Multimodal LLM’s Referential Dialogue MagicarXiv2023Github
MINIGPT-4: ENHANCING VISION-LANGUAGE UNDERSTANDING WITH ADVANCED LARGE LANGUAGE MODELSICLR2023GithubPrompt tuning

Prompting Model in Image-Text Matching (e.g. on CLIP)

Depending on the target of prompting, existing methods can be classified into three categories: prompting the text encoder, prompting the visual encoder, or jointly prompting both branches as shown in Fig. 2 . These approaches aim to enhance the flexibility and task-specific performance of VLMs.

<img src="./assets/chapt4_prompting_method.png"> <p align="center"> <i>Fig. 2: Classification of prompting methods on Image-Text Matching VLMs. </i> </p>
TitleVenueYearCode if availableComment
Learning Transferable Visual Models From Natural Language SupervisionICML2021GithubHard text prompts; Prompt for Image classification
Delving into the Openness of CLIPACL2023GithubHard text prompts for understanding
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language ModelsNeurIPS2022GithubSoft text prompts
Learning to Prompt for Vision-Language ModelsIJCV2022GithubSoft text prompts
Prompting Visual-Language Models for Efficient Video UnderstandingECCV2022GithubSoft text prompts
Multitask Vision-Language Prompt TuningWACV2024GithubSoft text prompts
Conditional Prompt Learning for Vision-Language ModelsCVPR2022GithubSoft text prompts
Visual Prompt TuningECCV2022GithubVisual patch-wise prompts
Exploring Visual Prompts for Adapting Large-Scale ModelsarXiv2022GithubVisual patch-wise prompts
Multitask Vision-Language Prompt TuningWACV2024GithubVisual patch-wise prompts
Unleashing the Power of Visual Prompting At the Pixel LevelTMLR2024GithubVisual patch-wise prompts
Diversity-Aware Meta Visual PromptingCVPR2023GithubVisual patch-wise prompts
CPT: Colorful Prompt Tuning for Pre-trained Vision-Language ModelsAI open2024GithubVisual annotation prompts
What does CLIP know about a red circle? Visual prompt engineering for VLMsICCV2023---Visual annotation prompts
Visual Prompting via Image InpaintingNeurIPS2022GithubVisual annotation prompts
Unified Vision and Language Prompt LearningarXiv2023GithubCoupled unified prompting
Multitask Vision-Language Prompt TuningWACV2024GithubDecoupled unified prompting
MaPLe: Multi-modal Prompt LearningCVPR2023GithubDecoupled unified prompting
Understanding Zero-shot Adversarial Robustness for Large-Scale ModelsICLR2023CodeAdversarial robustness of prompt
Visual Prompting for Adversarial RobustnessICASSP2023GithubAdversarial robustness of prompt
Align before Fuse: Vision and Language Representation Learning with Momentum DistillationNeurIPS2021GithubImage-Text Matching Model
Unsupervised Prompt Learning for Vision-Language ModelsarXiv2022GithubUnspervised learnable prompts
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language ModelsNeurIPS2022GithubLearnable prompt
Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual RecognitionNeurIPS2023GithubPrompt Pre-Training
Consistency-guided Prompt Learning for Vision-Language ModelsICLR2024---Decoupled unified prompting
Improving Adaptability and Generalizability of Efficient Transfer Learning for Vision-Language ModelsACL ARR2024---Learnable prompt

Applications & Responsible AI

TitleVenueYearCode if availableComment
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual RecognitionALVR2024GithubPrompts for long-tailed multi-label image classification
Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language ModelsNeurIPS2022GithubLearnable prompt; Prompts for image classification
LPT: Long-tailed Prompt Tuning for Image ClassificationICLR2023GithubPrompts for long-tailed image classification
Texts as Images in Prompt Tuning for Multi-Label Image RecognitionCVPR2023GithubPrompts for multi-label image classification and detection
DualCoOp: Fast Adaptation to Multi-Label Recognition with Limited AnnotationsNeurIPS2022GithubPrompts for multi-label image classification and recognition
Visual Prompt Tuning for Few-Shot Text ClassificationICCL2022---Visual prompts for text classification
Open-vocabulary Object Detection via Vision and Language Knowledge DistillationICLR2021GithubPrompts for object detection
Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language ModelCVPR2022GithubPrompts for object detection
PromptDet: Towards Open-vocabulary Detection using Uncurated ImagesECCV2022GithubPrompts for object detection
Optimizing Continuous Prompts for Visual Relationship Detection by Affix-TuningIEEE Access2022---Soft prompts for visual relation detection
Towards Open-vocabulary Scene Graph Generation with Prompt-based FinetuningECCV2022---Soft prompts for visual relation detection
Compositional Prompt Tuning with Motion Cues for Open-vocabulary Video Relation DetectionICLR2023GithubRelation Prompts for video open-vocabulary relation detection
DenseCLIP: Language-Guided Dense Prediction with Context-Aware PromptingCVPR2022GithubClass-conditioned text prompts for semantic segmentation
Segment AnythingICCV2023GithubPromptable queries for semantic segmentation
Domain Adaptation via Prompt LearningIEEE2023GithubDomain-specific textual prompts for domain adaptation
Visual Prompt Tuning for Test-time Domain AdaptationarXiv2022---Prompts for domain adaptation
Learning to Prompt for Continual LearningCVPR2022GithubPrompts for continual learning
DualPrompt: Complementary Prompting for Rehearsal-free Continual LearningECCV2022GithubPrompts for continual learning
Prompt Vision Transformer for Domain GeneralizationarXiv2022GithubPrompts for domain generalization
Understanding Zero-Shot Adversarial Robustness for Large-Scale ModelsLCLR2022GithubVisual prompt tuning under adversarial attack
Visual Prompting for Adversarial RobustnessICASSP2023GithubVisual prompting to improve the adversarial robustness
Exploring the Universal Vulnerability of Prompt-based Learning ParadigmNAACL2022GithubVisual prompting vulnerability
Poisoning and Backdooring Contrastive LearningICLR2022---Backdoor and poisoning attacks on CLIP
BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised LearningIEEE2022GithubBackdoor attack on CLIP
CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning ICLR Workshop2023---Defense backdoor attacks on CLIP
Debiasing Vision-Language Models via Biased PromptsarXiv2023GithubPrompts to alleviate bias

Prompting Model in Text-to-Image Generation (e.g. on Stable Diffusion)

TitleVenueYearCode if availableComment
Diffusion Models Beat GANs on Image SynthesisNeurIPS2021GithubDiffusion models on image generation
Diffusion Models Beat GANs on Image SynthesisNeurIPS2021GithubDiffusion models on image generation
Denoising Diffusion Probabilistic ModelsNeurIPS2020GithubDiffusion models on image generation
SuS-X: Training-Free Name-Only Transfer of Vision-Language ModelsICCV2023GithubDiffusion models on image generation
Investigating Prompt Engineering in Diffusion ModelsNeurIPS Workshop2022---Semantic prompt design
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion ModelsIEEE/CVF2023GithubDiversify generation with prompt; Prompts for synthetic data generation
Is synthetic data from generative models ready for image recognition?ICLR2023GithubDiversify generation with prompt
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual InversionICLR2023GithubComplex control of synthesis results via prompts
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven GenerationCVPR2023GithubComplex control of synthesis results via prompts
Multi-Concept Customization of Text-to-Image DiffusionCVPR2023GithubComplex control of synthesis results via prompts
Prompt-to-Prompt Image Editing with Cross Attention ControlICLR2023---Complex control of synthesis results via prompts
Training-Free Structured Diffusion Guidance for Compositional Text-to-Image SynthesisICLR2023GithubControllable text-to-image generation
Diffusion Self-Guidance for Controllable Image GenerationNeurIPS2023PageControllable text-to-image generation
Imagic: Text-Based Real Image Editing with Diffusion ModelsCVPR2023GithubControllable text-to-image generation
Adding Conditional Control to Text-to-Image Diffusion ModelsIEEE/CVF2023GithubControllable text-to-image generation
Prompt-to-Prompt Image Editing with Cross Attention ControlICLR2023GithubComplex control of synthesis results via prompts
ImaginaryNet: Learning Object Detectors without Real Images and AnnotationsICLR2023GithubPrompts for synthetic data generation
Is synthetic data from generative models ready for image recognition?ICLR2023GithubPrompts for synthetic data generation
Make-A-Video: Text-to-Video Generation without Text-Video DataICLR2023PagePrompts for text-to-video generation
Imagen Video: High Definition Video Generation with Diffusion ModelsarXiv2022PagePrompts for text-to-video generation
FateZero: Fusing Attentions for Zero-shot Text-based Video EditingICCV2023GithubPrompts for text-to-video generation
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video GenerationICCV2023GithubPrompts for text-to-video generation
DiffRF: Rendering-Guided 3D Radiance Field DiffusionCVPR2023PagePrompts for text-to-3D generation
DreamFusion: Text-to-3D using 2D DiffusionICLR notable top 5%2023PagePrompts for text-to-3D generation
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion ModelsCVPR2023PagePrompts for text-to-3D generation
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion ModelIEEE2024PagePrompts for text-to-motion generation
FLAME: Free-form Language-based Motion Synthesis & EditingAAAI2023GithubPrompts for text-to-motion generation
MDM: Human Motion Diffusion ModelICLR2023GithubPrompts for text-to-motion generation
Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion ModelsarXiv2023---Prompts for complex tasks
Multimodal Procedural Planning via Dual Text-Image PromptingICLR2024GithubPrompts for complex tasks
Prompt Stealing Attacks Against Text-to-Image Generation ModelsUSENIX Security Symposium2023---Prompts for responsible AI
Membership Inference Attacks Against Text-to-image Generation ModelsICLR2023---Membership attacks against text-to-image models
Are Diffusion Models Vulnerable to Membership Inference Attacks?ICML2023GithubMembership attacks against text-to-image models
A Reproducible Extraction of Training Images from Diffusion ModelsarXiv2023GithubMembership attacks against text-to-image models
Fair Diffusion: Instructing Text-to-Image Generation Models on FairnessarXiv2023GithubPrompts on text-to-image models considering fairness
Social Biases through the Text-to-Image Generation LensAAAI/ACM2023---Prompts on text-to-image models considering biases
T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image GenerationACL2023---Prompts on text-to-image models considering biases
Stable Bias: Analyzing Societal Representations in Diffusion ModelsNeurIPS2023---Prompts on text-to-image models considering biases
A Pilot Study of Query-Free Adversarial Attack Against Stable DiffusionCVPR2023---Adversarial robustness of text-to-image models
Diffusion Models for Imperceptible and Transferable Adversarial AttackICLR2024GithubAdversarial robustness of text-to-image models
Diffusion Models for Adversarial PurificationICML2022GithubAdversarial robustness of text-to-image models
Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image SynthesisICCV2023---Backdoor attack on text-to-image models
Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data PoisoningACM MM2023---Backdoor attack on text-to-image models
Personalization as a Shortcut for Few-Shot Backdoor Attack against Text-to-Image Diffusion ModelsAAAI2024---Backdoor attack on text-to-image models

# :mailbox_with_mail: Contact

Please contact us (jindong.gu@outlook.com, chenshuo.cs@outlook.com) if