Awesome
Visual Table
This is the official PyTorch implementation of Visual Table (EMNLP 2024).
<p align="center"> <img src="docs/teaser_figure.png" width=97% height=97% class="center"> </p>Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models [Paper] <br> Yiwu Zhong<sup>*1</sup>, Zi-Yuan Hu<sup>*1,2</sup>, Michael R. Lyu<sup>1</sup>, Liwei Wang<sup>#1</sup> <br> <sup>1</sup>The Chinese University of Hong Kong, <sup>2</sup>Shanghai AI Laboratory <br> (<sup>*</sup> equal contributions, <sup>#</sup> corresponding author) <br>
Overview
Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for general visual reasoning.
- Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple descriptions covering categories, attributes, and knowledge for individual object instances.
- Thanks to the structural and textual formats, visual tables offer unique properties over mere visual embeddings, such as explainability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning.
- To create visual tables, we design prompts for GPT4V to produce annotations on 61K images. We then develop a generator trained on these annotations. The annotations and the trained generator are released to the public.
- Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks.
Updates
- [10/6] š„ Visual Table is now released! You can download the visual table annotations from Huggingface and pre-trained visual table generator from Huggingface. More details can be found below.
Install
Note: For more details about the environment installation, please refer to LLaVA.
- Clone this repository and navigate to Visual-Table folder
git clone https://github.com/LaVi-Lab/Visual-Table
cd Visual-Table
- Install Package
conda create -n vt python=3.10 -y
conda activate vt
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Data & Weight Preparation
-
Download
data_VT.zip
from Huggingface or Google Drive.Notes:
- Unzip
data_VT.zip
to./playground/data_VT
. This zip file contains all materials required for training & evaluating visual table generator & our multi-modal LLMs. The visual table annotations have been converted into the instruction tuning samples for training the generator. - If you would like to use visual table annotations only, you can just download the file
visual_table_annotations.json
. This file include 61K visual table annotations collected from GPT4V, as well as a few visual tables from other datasets for ablation purposes (e.g., GQA, MM-Vet, MMMU, MMVP). The annotation scripts can be found inpreprocess/collect_gpt4v_VT/gpt4v.py
.
The following is the file structure for your convenience:
<details> <summary>Click for more details... </summary>
</details>./playground/data_VT āāā eval ā āāā gqa ā ā āāā answers ā ā ā āāā gqa_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā merge.jsonl ā ā ā āāā testdev_balanced_predictions.json ā ā āāā data ā ā āāā gqa_with_VTGenerator-13B_gen_vt.jsonl ā āāā llava-bench-in-the-wild ā ā āāā answers ā ā ā āāā llavabench_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā LLaVA-VT-13B.jsonl ā ā āāā llavabench_with_VTGenerator-13B_gen_vt.jsonl ā ā āāā reviews ā ā āāā llavabench_with_VTGenerator-13B_gen_vt ā ā āāā LLaVA-VT-13B ā ā āāā LLaVA-VT-13B_gpt-3.5-turbo-1106.jsonl ā āāā mmbench ā ā āāā answers ā ā ā āāā mmbench_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā LLaVA-VT-13B.jsonl ā ā āāā answers_upload ā ā āāā mmbench_with_VTGenerator-13B_gen_vt ā ā āāā LLaVA-VT-13B ā ā āāā LLaVA-VT-13B.xlsx ā āāā mmmu ā ā āāā answers ā ā ā āāā mmmu_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā gpt_eval_gpt-3.5-turbo-1106 ā ā ā āāā mmmu_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.jsonl ā ā āāā mmmu.jsonl ā ā āāā mmmu_with_VTGenerator-13B_gen_vt.jsonl ā āāā mm-vet ā ā āāā mmvet_with_VTGenerator-13B_gen_vt ā ā ā āāā answers ā ā ā ā āāā LLaVA-VT-13B ā ā ā ā āāā mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.jsonl ā ā ā āāā results ā ā ā āāā LLaVA-VT-13B ā ā ā āāā mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B_gpt-4-32k-0613-cap-int-score-1runs.csv ā ā ā āāā mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B_gpt-4-32k-0613-cap-score-1runs.csv ā ā ā āāā mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B_gpt-4-32k-0613-grade-1runs.json ā ā ā āāā mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.json ā ā āāā mmvet_with_VTGenerator-13B_gen_vt.jsonl ā āāā mmvp_mc ā ā āāā answers ā ā ā āāā mmvp_mc_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā mmvp_mc_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.jsonl ā ā āāā mmvp_mc.jsonl ā ā āāā mmvp_mc_with_VTGenerator-13B_gen_vt.jsonl ā āāā pope ā ā āāā answers ā ā ā āāā pope_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā merge.jsonl ā ā āāā pope_coco_commitID_e3e39262c85a6a83f26cf5094022a782cb0df58d ā ā ā āāā coco_pope_adversarial.json ā ā ā āāā coco_pope_popular.json ā ā ā āāā coco_pope_random.json ā ā āāā pope_with_VTGenerator-13B_gen_vt.jsonl ā āāā scienceqa ā ā āāā answers ā ā ā āāā scienceqa_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā LLaVA-VT-13B.jsonl ā ā ā āāā LLaVA-VT-13B_output.jsonl ā ā ā āāā LLaVA-VT-13B_result.json ā ā āāā scienceqa_with_VTGenerator-13B_gen_vt.json ā āāā textvqa ā ā āāā answers ā ā ā āāā textvqa_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā merge.jsonl ā ā āāā textvqa_with_VTGenerator-13B_gen_vt.jsonl ā āāā vizwiz ā ā āāā answers ā ā ā āāā vizwiz_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā merge.jsonl ā ā āāā answers_upload ā ā ā āāā vizwiz_with_VTGenerator-13B_gen_vt ā ā ā āāā LLaVA-VT-13B ā ā ā āāā LLaVA-VT-13B.json ā ā āāā vizwiz_with_VTGenerator-13B_gen_vt.jsonl ā āāā vqav2 ā āāā answers_upload ā ā āāā vqav2_dev_with_VTGenerator-13B_gen_vt ā ā āāā LLaVA-VT-13B.json ā āāā vqav2_dev_with_VTGenerator-13B_gen_vt.jsonl āāā eval_images_gen_vt ā āāā gqa_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā gqa_gen_vt.jsonl ā āāā llavabench_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā llavabench_gen_vt.jsonl ā āāā mmbench_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā mmbench_gen_vt.jsonl ā āāā mmmu_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā mmmu_gen_vt.jsonl ā āāā mmvet_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā mmvet_gen_vt.jsonl ā āāā mmvp_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā mmvp_gen_vt.jsonl ā āāā pope_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā pope_gen_vt.jsonl ā āāā scienceqa_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā scienceqa_gen_vt.jsonl ā āāā textvqa_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā textvqa_gen_vt.jsonl ā āāā vizwiz_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā vizwiz_gen_vt.jsonl ā āāā vqav2_gen_vt ā ā āāā VTGenerator-13B ā ā āāā merge.jsonl ā āāā vqav2_gen_vt.jsonl āāā gpt_eval ā āāā gqa ā ā āāā gqa.jsonl ā ā āāā gqa_with_gpt4v_vt.jsonl ā āāā mmmu ā ā āāā mmmu.jsonl ā ā āāā mmmu_with_gpt4v_vt.jsonl ā āāā mmvet ā ā āāā mmvet.jsonl ā ā āāā mmvet_with_gpt4v_vt.jsonl ā āāā mmvp ā āāā mmvp.jsonl ā āāā mmvp_with_gpt4v_vt.jsonl āāā README.md āāā train_images_gen_vt ā āāā llava_instruct_mix665k_coco_gen_vt.jsonl ā āāā llava_instruct_mix665k_ocrvqa_gen_vt.jsonl ā āāā llava_instruct_mix665k_textcap_gen_vt.jsonl ā āāā llava_instruct_mix665k_vg_gen_vt.jsonl ā āāā VTGenerator-13B_VT_292k.json āāā train_LLaVA-VT ā āāā llava_instruct_mix665k_with_VT.json āāā train_VTGenerator āāā finetune_VTGenerator_gpt4v_VT_61k.json āāā pretrain_VTGenerator_llava_instruct_mix199k.json
- Unzip
-
Following LLaVA, download the data for visual instruction tuning.
The following is the file structure for your convenience:
<details> <summary>Click for more details... </summary>
</details>./playground/data āāā coco ā āāā train2017 āāā gqa ā āāā images āāā ocr_vqa ā āāā images āāā textvqa ā āāā train_images āāā vg āāā VG_100K āāā VG_100K_2
-
Following LLaVA/docs/Evaluation.md, download the data and
eval.zip
for evaluation.For MMMU and MMVP, download the data from their official repos.
-
Download the pretrained model weights:
Note: you have to modify the path to these pretrained model weights on the training scripts.
Pretrained Model Weight Download Link lmsys/vicuna-13b-v1.5 Huggingface openai/clip-vit-large-patch14-336 Huggingface liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5 Huggingface
VTGenerator
After collecting the visual table annotations, we train a VTGenerator to produce visual tables for the input images. Specifically, the training includes pre-training and fine-tuning stage, as described as follows (more details can be found in paper Sec. 3.2). Once trained, VTGenerator can be used to generate visual tables for downstream tasks such as VQA.
VTGenerator Training
For quick usage, we have provided the checkpoints of our VTGenerator as follows:
Model | Download Link |
---|---|
VTGenerator-13B (Preferred) | Huggingface |
VTGenerator-7B | Huggingface |
If you want to reproduce the training of VTGenerator, you can follow the following scripts:
Pretraining stage:
Details are provided in ./scripts/VTGenerator/train/pretrain_VTGenerator-Pretrained-13B.sh
.
mkdir -p scripts/log/VTGenerator
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/train/finetune_VTGenerator-13B.sh 2>&1 | tee -a scripts/log/VTGenerator/finetune_VTGenerator-13B.txt
Fine-tuning stage:
Details are provided in ./scripts/VTGenerator/train/finetune_VTGenerator-13B.sh
.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/train/finetune_VTGenerator-13B.sh 2>&1 | tee -a scripts/log/VTGenerator/finetune_VTGenerator-13B.txt
VTGenerator Inference (Visual Table Generation)
We have provided the visual table generation results of VTGenerator-13B
in data_VT.zip
:
Visual tables for training images:
./playground/data_VT/train_images_gen_vt
Visual tables for evaluation images:
./playground/data_VT/eval_images_gen_vt
If you want to reproduce the inference of VTGenerator, you can follow the following scripts:
For images used in downstream task training:
Details are provided in ./scripts/VTGenerator/infer/train_images_gen_vt.sh
.
# infer VTGenerator-13B on llava_instruct_mix665k
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_coco_gen_vt.sh VTGenerator-13B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_ocrvqa_gen_vt.sh VTGenerator-13B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_textcap_gen_vt.sh VTGenerator-13B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_vg_gen_vt.sh VTGenerator-13B
# merge the inference results & store VTGenerator-13B_VT_292k.json & store llava_instruct_mix665k_with_VT.json
python ./scripts/VTGenerator/infer/train_images_gen_vt/merge_llava_instruct_mix665k_all_gen_vt.py \
--gen_VT_path './playground/data_VT/train_images_gen_vt/VTGenerator-13B_VT_292k.json' \
--llava_instruct_mix665k_path '/path/to/liuhaotian/LLaVA-Instruct-150K/llava_v1_5_mix665k.json' \
--image_path './playground/data' \
--llava_instruct_mix665k_with_VT './playground/data_VT/train_LLaVA-VT/llava_instruct_mix665k_with_VT.json'
For images used in downstream task evaluation:
Details are provided in ./scripts/VTGenerator/infer/eval_images_gen_vt.sh
.
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/eval_images_gen_vt.sh VTGenerator-13B
LLaVA-VT
We choose image reasoning task as the testbed of our generated visual tables, such as the benchmarks of multi-modal LLMs (MLLMs). Below, we provide the scripts for training a MLLM using visual tables as inputs, as well as the scripts for its evaluation. We refer this VQA model as LLaVA-VT in the following descriptions.
LLaVA-VT Training
For quick usage, we have provided the checkpoints of our LLaVA-VT as follows:
Model | Download Link |
---|---|
LLaVA-VT-13B (Preferred) | Huggingface |
LLaVA-VT-7B | Huggingface |
If you want to reproduce the training of LLaVA-VT, you can follow the following scripts:
Details are provided in ./scripts/LLaVA-VT/train/finetune_LLaVA-VT-13B.sh
.
# mkdir -p scripts/log/LLaVA-VT
# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/train/finetune_LLaVA-VT-13B.sh 2>&1 | tee -a scripts/log/LLaVA-VT/finetune_LLaVA-VT-13B.txt
LLaVA-VT Evaluation
<p align="center"> <img src="docs/table1.png" width=97% height=97% class="center"> </p>Notes: Before running the evaluation scripts, please:
-
download the evaluation data and
eval.zip
(following LLaVA/docs/Evaluation.md). -
utilize the provided visual tables for each dataset from
./playground/data_VT/eval_images_gen_vt
, or utilize./scripts/VTGenerator/infer/eval_images_gen_vt.sh
to generate the visual tables for each evaluation dataset. -
GPT-assisted evaluation is applied for mmvet, llavabench, and mmmu. More details are provided in the corresponding evaluation scripts. Please also refer to our paper:
4.1 Comparison Experiments. Setup.
Details of the evaluation scripts (Table 1 in the main paper) are provided in ./scripts/LLaVA-VT/eval/eval_multi_datasets_with_VT.sh
.
VTGenerator="VTGenerator-13B"
Model="LLaVA-VT-13B"
mkdir -p scripts/log/eval_multi_datasets_with_VT
<details>
<summary>Click for more details... </summary>
Evaluation on mmvet with visual table:
CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmvet/mmvet.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmvet_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on llavabench with visual table:
CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/llavabench/llavabench.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/llavabench_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on mmmu with visual table:
CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmmu/mmmu.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmmu_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on mmbench with visual table:
CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmbench/mmbench.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmbench_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on mmvp with visual table:
CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmvp_mc/mmvp_mc.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmvp_mc_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on pope with visual table:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/pope/pope.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/pope_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on vizwiz with visual table:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/vizwiz/vizwiz.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/vizwiz_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on scienceqa with visual table:
CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/scienceqa/scienceqa.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/scienceqa_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on gqa with visual table:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/gqa/gqa.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/gqa_full_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on vqav2 with visual table:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/vqav2/vqav2_dev.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/vqav2_dev_with_${VTGenerator}_gen_vt_${Model}.txt
Evaluation on textvqa with visual table:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/textvqa/textvqa.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/textvqa_with_${VTGenerator}_gen_vt_${Model}.txt
</details>
Citation
If you find this repo useful, please consider citing our paper:
@article{zhong2024beyond,
title={Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models},
author={Zhong, Yiwu and Hu, Zi-Yuan and Lyu, Michael R and Wang, Liwei},
journal={arXiv preprint arXiv:2403.18252},
year={2024}
}
Acknowledgement
- LLaVA: the codebase we build upon.