

[CVPR2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs


<p align="center"> <a href='https://julianjuaner.github.io/projects/PromptHighlighter/'><img src='https://img.shields.io/badge/project_page-aa55dd'></a> <a href='https://arxiv.org/abs/2312.04302'><img src='https://img.shields.io/badge/arXiv_paper-ee7744'></a> </p>

This is the official implementation of the CVPR2024 paper Prompt Highlighter: Interactive Control for Multi-Modal LLMs.

Control text generation by highlighting your prompt! Prompt Highlighter is a training-free inference pipeline that facilitates token-level user interactions for a customized generation. Our method is compatible with both LLMs and VLMs.




Quick Start

Basic enviornment setup:

conda create -n highlighter python=3.10 -y
conda activate highlighter
pip install -r requirements.txt


Install latest LLaVA model 2023-11-30 in base_models. If you already have one, you can use the installed one in your own enviornment.

# you may also use your installed llava if you have installed.
cd base_models
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Model Download: Please refer to LLaVAv1.5 Model Zoo to get the base pretrained model.

Partial Highlighting task: We provide examples in assets/test_data/questions_descriptions.json, you may add your new case to test our method.

python examples/llava_test.py

Descriptive task (highlighting all input contexts): We provide examples in assets/test_data/questions_descriptions.json, you may add your new case to test our method.

python examples/llava_descriptions.py

We will also provide a script for descriptive COCO caption generation (TODO here).

If you want to add your customized data, please provide a squared image that uses a darker (uint color < 128) marked region as the highlighter area. Add your case to the JSON file.

Benchmark Test: Please refer to evaluation data to get your benchmark dataset (MMBench & MME). Benchmark result:

Baseline (LLaVAv1.5-13B)1531.367.767.0
Ours (Official Reported)1552.569.769.5
Ours (This Repo)1552.570.170.7

For MMBench, you may change your hyper-params in the following script and run:

bash examples/eval_scripts/mmbench_dev_hl.sh
bash examples/eval_scripts/mmbench_test_hl.sh

For MME:

bash examples/eval_scripts/mme_hl.sh

You may found the evaluated metric at base_models/LLaVA/playground/data/eval/MME/eval_tool/answers/llava-v1.5-13b-hl-1.3-2.0-0.01/eval.log

Vicuna (LLaMA-based LLMs)

We provide a script to test the partial highlighter of the pure language input. Download the Vicuna model. We use the version Vicuna-13B-v1.1. You may change to any llama-based LLMs. In this case, you will also need to change the conversation prompt template. Please follow the instructions to - install the LLaVA in the base_model. If you have already installed the LLaVA, you may directly test with the script:

python examples/llama_test.py \
    --txt "Please write a summary of A Mid-Summer Nights' Dream, make it compact." \
    --hl "make it compact."

Here you may change your input prompt and highlighted segments by passing --txt and --hl, respectively. If you want to pass multiple highlighted segments, you may use a <s> to split them. For example, you can pass --hl "write a summary<s>make it compact." to highlight multiple requirements.


Install the latest LAVIS 2023-11-30 in base_models. If you already have one, you can use the installed one in your own environment.

To run the InstructBLIP-Vicuna, you need to add the LLM path (vicuna-13b v1.1) to the key llm_model in the configuration file base_models/LAVIS/lavis/configs/models/blip2/blip2_instruct_vicuna13b.yaml.

# Please install with your highlighter env activated.
cd base_models
git clone https://github.com/salesforce/LAVIS.git
pip install -e .

Partial Highlighting task: Run examples in assets/test_data/questions_descriptions.json, you may add your new case to test our method.

Note: Here, we only implement a highlighting mechanism in the QFormer. We may update a hybrid highlighting (visual & text token) version in the future.

python examples/instructblip_test.py




<p align="center"> <!-- pypi-strip --> <picture> <source media="(prefers-color-scheme: dark)" srcset="assets/pipeline_dark.png"> <source media="(prefers-color-scheme: light)" srcset="assets/pipeline.png"> <!-- /pypi-strip --> <img alt="pipeline" src="assets/pipeline.png" width="100%"> <!-- pypi-strip --> </picture><br> <!-- /pypi-strip --> </p>

An abstract pipeline of Prompt Highlighter. Users can control the focus of generation by marking out specific image regions or text spans. Then a token-level mask $\mathbf{m}$ is created to guide the language model's inference. Motivated by the classifier-free diffusion guidance, we form regular and unconditional context pairs based on highlighted tokens, demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably, we find that, during inference, guiding the models with highlighted tokens through the attention weights leads to more desired outputs.

Cite Prompt Highlighter

If you find this repo useful for your research, please consider citing the paper

  title={Prompt Highlighter: Interactive Control for Multi-Modal LLMs},
  author={Zhang, Yuechen and Qian, Shengju and Peng, Bohao and Liu, Shu and Jia, Jiaya},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},


We would like to thank the following repos for their great work: