Home

Awesome

<br> <p align="center"> <img src="assets/touchstone_logo.png" width="300"/> <p> <br> <div align="center"> <h1>TouchStone: Evaluating Vision-Language Models by Language Models

Paper

</div>

TOUCHSTONE is a comprehensive assessment of multimodal language models, encompassing not only basic recognition and comprehension but also extending to literary creation. By using strong LLMs as judges and converting multimodal information into text, our TouchStone allows for efficient and accurate assessment of dialogue quality, leveraging the power of advanced language models without the need for manual intervention.

DATASET

TouchStone is a diverse and comprehensive dataset that covers five key dimensions: Basic Descriptive Ability, Visual Recognition Ability, Visual Comprehension Ability, Visual Storytelling Ability, and Multi-image Analysis Ability. You can download the dataset here.

<p align="center"> <img src="assets/datasets.jpg" width="600"/> <p>

Our dataset currently places more emphasis on assessing basic abilities, where the highest proportion of questions pertains to recognition, accounting for about 44.1%, followed by comprehension questions at 29.6%. The proportions of the other categories are 15.3% for basic descriptive ability, 7.4% for visual storytelling ability, and 3.6% for multi-image analysis ability. There are a total of 908 dialogue.

Methods

TouchStone leverages fine-grained annotation and strong LLMs to evaluate LVLMs. Firstly, fine-grained descriptions of images are obtained through manual annotation and inspection. These descriptions, along with questions, are fed into GPT-4 (text-only) to generate reference answers. On the other hand, different LVLMs directly take visual signals and questions as input to generate answers. The generated answers, reference answers, questions, and fine-grained descriptions are all scored by GPT-4. The final scores are averaged and used to rank the models, representing their comprehensive performance.

<p align="center"> <img src="assets/pipeline.jpg" width="600"/> <p>

New Results

RankModelScore
🏅️GPT-4V803.5
🥈CogVLM742.0
🥉Qwen-VL711.6
4Emu2703.8
5mPLUG-Owl605.4
6LLaVA602.7
7LLaMA-AdapterV2590.1
8InstructBLIP552.4
9MiniGPT4531.7
10PandaGPT488.5

Evaluation Results

<p align="center"> <img src="assets/touchstone_score.jpg" width="600"/> <p>

Run Evaluation

<details> <summary>Read image</summary>
import io
import base64
import pandas as pd
from PIL import Image

def decode_base64_to_image(base64_string):
    image_data = base64.b64decode(base64_string)
    image = Image.open(io.BytesIO(image_data))
    return image

df = pd.read_csv("touchstone_20230831.tsv", sep='\t')
index = 0
image = decode_base64_to_image(df.iloc[index]['image'])
question = df.iloc[index]['question']
human_annotation = df.iloc[index]['human_annotation']
gpt4_ha_answer = df.iloc[index]['gpt4_ha_answer']
category = df.iloc[index]['category']
task_name = df.iloc[index]['task_name']
</details> <details> <summary>Format requirement</summary> </details>

The evaluation script is provided in eval.py.

python eval.py submit_file openai_key --model-name your_model 

Citation

@misc{bai2023touchstone,
      title={TouchStone: Evaluating Vision-Language Models by Language Models}, 
      author={Shuai Bai and Shusheng Yang and Jinze Bai and Peng Wang and Xingxuan Zhang and Junyang Lin and Xinggang Wang and Chang Zhou and Jingren Zhou},
      year={2023},
      eprint={2308.16890},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}