Home

Awesome

Visual Table

This is the official PyTorch implementation of Visual Table (EMNLP 2024).

Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models [Paper] <br> Yiwu Zhong<sup>*1</sup>, Zi-Yuan Hu<sup>*1,2</sup>, Michael R. Lyu<sup>1</sup>, Liwei Wang<sup>#1</sup> <br> <sup>1</sup>The Chinese University of Hong Kong, <sup>2</sup>Shanghai AI Laboratory <br> (<sup>*</sup> equal contributions, <sup>#</sup> corresponding author) <br>

<p align="center"> <img src="docs/teaser_figure.png" width=97% height=97% class="center"> </p>

Overview

Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for general visual reasoning.

Updates

Install

Note: For more details about the environment installation, please refer to LLaVA.

  1. Clone this repository and navigate to Visual-Table folder
git clone https://github.com/LaVi-Lab/Visual-Table
cd Visual-Table
  1. Install Package
conda create -n vt python=3.10 -y
conda activate vt
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
  1. Install additional packages for training cases
pip install -e ".[train]"
pip install flash-attn --no-build-isolation

Data & Weight Preparation

  1. Download data_VT.zip from Huggingface or Google Drive.

    Notes:

    1. Unzip data_VT.zip to ./playground/data_VT. This zip file contains all materials required for training & evaluating visual table generator & our multi-modal LLMs. The visual table annotations have been converted into the instruction tuning samples for training the generator.
    2. If you would like to use visual table annotations only, you can just download the file visual_table_annotations.json. This file include 61K visual table annotations collected from GPT4V, as well as a few visual tables from other datasets for ablation purposes (e.g., GQA, MM-Vet, MMMU, MMVP). The annotation scripts can be found in preprocess/collect_gpt4v_VT/gpt4v.py.

    The following is the file structure for your convenience:

    <details> <summary>Click for more details... </summary>
    ./playground/data_VT
    ā”œā”€ā”€ eval
    ā”‚   ā”œā”€ā”€ gqa
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ gqa_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā”œā”€ā”€ merge.jsonl
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ testdev_balanced_predictions.json
    ā”‚   ā”‚   ā”œā”€ā”€ data
    ā”‚   ā”‚   ā””ā”€ā”€ gqa_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ llava-bench-in-the-wild
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ llavabench_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ LLaVA-VT-13B.jsonl
    ā”‚   ā”‚   ā”œā”€ā”€ llavabench_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ reviews
    ā”‚   ā”‚       ā””ā”€ā”€ llavabench_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚           ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚               ā””ā”€ā”€ LLaVA-VT-13B_gpt-3.5-turbo-1106.jsonl
    ā”‚   ā”œā”€ā”€ mmbench
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ mmbench_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ LLaVA-VT-13B.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ answers_upload
    ā”‚   ā”‚       ā””ā”€ā”€ mmbench_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚           ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚               ā””ā”€ā”€ LLaVA-VT-13B.xlsx
    ā”‚   ā”œā”€ā”€ mmmu
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ mmmu_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā”œā”€ā”€ gpt_eval_gpt-3.5-turbo-1106
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ mmmu_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.jsonl
    ā”‚   ā”‚   ā”œā”€ā”€ mmmu.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ mmmu_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ mm-vet
    ā”‚   ā”‚   ā”œā”€ā”€ mmvet_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.jsonl
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ results
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā”œā”€ā”€ mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B_gpt-4-32k-0613-cap-int-score-1runs.csv
    ā”‚   ā”‚   ā”‚           ā”œā”€ā”€ mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B_gpt-4-32k-0613-cap-score-1runs.csv
    ā”‚   ā”‚   ā”‚           ā”œā”€ā”€ mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B_gpt-4-32k-0613-grade-1runs.json
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ mmvet_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.json
    ā”‚   ā”‚   ā””ā”€ā”€ mmvet_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ mmvp_mc
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ mmvp_mc_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ mmvp_mc_with_VTGenerator-13B_gen_vt_LLaVA-VT-13B.jsonl
    ā”‚   ā”‚   ā”œā”€ā”€ mmvp_mc.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ mmvp_mc_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ pope
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ pope_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”‚   ā”œā”€ā”€ pope_coco_commitID_e3e39262c85a6a83f26cf5094022a782cb0df58d
    ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ coco_pope_adversarial.json
    ā”‚   ā”‚   ā”‚   ā”œā”€ā”€ coco_pope_popular.json
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ coco_pope_random.json
    ā”‚   ā”‚   ā””ā”€ā”€ pope_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ scienceqa
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ scienceqa_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā”œā”€ā”€ LLaVA-VT-13B.jsonl
    ā”‚   ā”‚   ā”‚           ā”œā”€ā”€ LLaVA-VT-13B_output.jsonl
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ LLaVA-VT-13B_result.json
    ā”‚   ā”‚   ā””ā”€ā”€ scienceqa_with_VTGenerator-13B_gen_vt.json
    ā”‚   ā”œā”€ā”€ textvqa
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ textvqa_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ textvqa_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ vizwiz
    ā”‚   ā”‚   ā”œā”€ā”€ answers
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ vizwiz_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”‚   ā”œā”€ā”€ answers_upload
    ā”‚   ā”‚   ā”‚   ā””ā”€ā”€ vizwiz_with_VTGenerator-13B_gen_vt
    ā”‚   ā”‚   ā”‚       ā””ā”€ā”€ LLaVA-VT-13B
    ā”‚   ā”‚   ā”‚           ā””ā”€ā”€ LLaVA-VT-13B.json
    ā”‚   ā”‚   ā””ā”€ā”€ vizwiz_with_VTGenerator-13B_gen_vt.jsonl
    ā”‚   ā””ā”€ā”€ vqav2
    ā”‚       ā”œā”€ā”€ answers_upload
    ā”‚       ā”‚   ā””ā”€ā”€ vqav2_dev_with_VTGenerator-13B_gen_vt
    ā”‚       ā”‚       ā””ā”€ā”€ LLaVA-VT-13B.json
    ā”‚       ā””ā”€ā”€ vqav2_dev_with_VTGenerator-13B_gen_vt.jsonl
    ā”œā”€ā”€ eval_images_gen_vt
    ā”‚   ā”œā”€ā”€ gqa_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ gqa_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ llavabench_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ llavabench_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ mmbench_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ mmbench_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ mmmu_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ mmmu_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ mmvet_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ mmvet_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ mmvp_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ mmvp_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ pope_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ pope_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ scienceqa_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ scienceqa_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ textvqa_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ textvqa_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ vizwiz_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā”œā”€ā”€ vizwiz_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ vqav2_gen_vt
    ā”‚   ā”‚   ā””ā”€ā”€ VTGenerator-13B
    ā”‚   ā”‚       ā””ā”€ā”€ merge.jsonl
    ā”‚   ā””ā”€ā”€ vqav2_gen_vt.jsonl
    ā”œā”€ā”€ gpt_eval
    ā”‚   ā”œā”€ā”€ gqa
    ā”‚   ā”‚   ā”œā”€ā”€ gqa.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ gqa_with_gpt4v_vt.jsonl
    ā”‚   ā”œā”€ā”€ mmmu
    ā”‚   ā”‚   ā”œā”€ā”€ mmmu.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ mmmu_with_gpt4v_vt.jsonl
    ā”‚   ā”œā”€ā”€ mmvet
    ā”‚   ā”‚   ā”œā”€ā”€ mmvet.jsonl
    ā”‚   ā”‚   ā””ā”€ā”€ mmvet_with_gpt4v_vt.jsonl
    ā”‚   ā””ā”€ā”€ mmvp
    ā”‚       ā”œā”€ā”€ mmvp.jsonl
    ā”‚       ā””ā”€ā”€ mmvp_with_gpt4v_vt.jsonl
    ā”œā”€ā”€ README.md
    ā”œā”€ā”€ train_images_gen_vt
    ā”‚   ā”œā”€ā”€ llava_instruct_mix665k_coco_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ llava_instruct_mix665k_ocrvqa_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ llava_instruct_mix665k_textcap_gen_vt.jsonl
    ā”‚   ā”œā”€ā”€ llava_instruct_mix665k_vg_gen_vt.jsonl
    ā”‚   ā””ā”€ā”€ VTGenerator-13B_VT_292k.json
    ā”œā”€ā”€ train_LLaVA-VT
    ā”‚   ā””ā”€ā”€ llava_instruct_mix665k_with_VT.json
    ā””ā”€ā”€ train_VTGenerator
        ā”œā”€ā”€ finetune_VTGenerator_gpt4v_VT_61k.json
        ā””ā”€ā”€ pretrain_VTGenerator_llava_instruct_mix199k.json
    
    </details>
  2. Following LLaVA, download the data for visual instruction tuning.

    The following is the file structure for your convenience:

    <details> <summary>Click for more details... </summary>
    ./playground/data
    ā”œā”€ā”€ coco
    ā”‚   ā””ā”€ā”€ train2017
    ā”œā”€ā”€ gqa
    ā”‚   ā””ā”€ā”€ images
    ā”œā”€ā”€ ocr_vqa
    ā”‚   ā””ā”€ā”€ images
    ā”œā”€ā”€ textvqa
    ā”‚   ā””ā”€ā”€ train_images
    ā””ā”€ā”€ vg
        ā”œā”€ā”€ VG_100K
        ā””ā”€ā”€ VG_100K_2
    
    </details>
  3. Following LLaVA/docs/Evaluation.md, download the data and eval.zip for evaluation.

    For MMMU and MMVP, download the data from their official repos.

  4. Download the pretrained model weights:

    Note: you have to modify the path to these pretrained model weights on the training scripts.

    Pretrained Model WeightDownload Link
    lmsys/vicuna-13b-v1.5Huggingface
    openai/clip-vit-large-patch14-336Huggingface
    liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5Huggingface

VTGenerator

After collecting the visual table annotations, we train a VTGenerator to produce visual tables for the input images. Specifically, the training includes pre-training and fine-tuning stage, as described as follows (more details can be found in paper Sec. 3.2). Once trained, VTGenerator can be used to generate visual tables for downstream tasks such as VQA.

VTGenerator Training

For quick usage, we have provided the checkpoints of our VTGenerator as follows:

ModelDownload Link
VTGenerator-13B (Preferred)Huggingface
VTGenerator-7BHuggingface

If you want to reproduce the training of VTGenerator, you can follow the following scripts:

Pretraining stage:

Details are provided in ./scripts/VTGenerator/train/pretrain_VTGenerator-Pretrained-13B.sh.

mkdir -p scripts/log/VTGenerator

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/train/finetune_VTGenerator-13B.sh 2>&1 | tee -a scripts/log/VTGenerator/finetune_VTGenerator-13B.txt

Fine-tuning stage:

Details are provided in ./scripts/VTGenerator/train/finetune_VTGenerator-13B.sh.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/train/finetune_VTGenerator-13B.sh 2>&1 | tee -a scripts/log/VTGenerator/finetune_VTGenerator-13B.txt

VTGenerator Inference (Visual Table Generation)

We have provided the visual table generation results of VTGenerator-13B in data_VT.zip:

Visual tables for training images: 
    ./playground/data_VT/train_images_gen_vt

Visual tables for evaluation images: 
    ./playground/data_VT/eval_images_gen_vt

If you want to reproduce the inference of VTGenerator, you can follow the following scripts:

For images used in downstream task training:

Details are provided in ./scripts/VTGenerator/infer/train_images_gen_vt.sh.

# infer VTGenerator-13B on llava_instruct_mix665k
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_coco_gen_vt.sh VTGenerator-13B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_ocrvqa_gen_vt.sh VTGenerator-13B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_textcap_gen_vt.sh VTGenerator-13B
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/train_images_gen_vt/llava_instruct_mix665k_vg_gen_vt.sh VTGenerator-13B 

# merge the inference results & store VTGenerator-13B_VT_292k.json & store llava_instruct_mix665k_with_VT.json
python ./scripts/VTGenerator/infer/train_images_gen_vt/merge_llava_instruct_mix665k_all_gen_vt.py \
    --gen_VT_path './playground/data_VT/train_images_gen_vt/VTGenerator-13B_VT_292k.json' \
    --llava_instruct_mix665k_path '/path/to/liuhaotian/LLaVA-Instruct-150K/llava_v1_5_mix665k.json' \
    --image_path './playground/data' \
    --llava_instruct_mix665k_with_VT './playground/data_VT/train_LLaVA-VT/llava_instruct_mix665k_with_VT.json' 

For images used in downstream task evaluation:

Details are provided in ./scripts/VTGenerator/infer/eval_images_gen_vt.sh.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/VTGenerator/infer/eval_images_gen_vt.sh VTGenerator-13B

LLaVA-VT

We choose image reasoning task as the testbed of our generated visual tables, such as the benchmarks of multi-modal LLMs (MLLMs). Below, we provide the scripts for training a MLLM using visual tables as inputs, as well as the scripts for its evaluation. We refer this VQA model as LLaVA-VT in the following descriptions.

LLaVA-VT Training

For quick usage, we have provided the checkpoints of our LLaVA-VT as follows:

ModelDownload Link
LLaVA-VT-13B (Preferred)Huggingface
LLaVA-VT-7BHuggingface

If you want to reproduce the training of LLaVA-VT, you can follow the following scripts:

Details are provided in ./scripts/LLaVA-VT/train/finetune_LLaVA-VT-13B.sh.

# mkdir -p scripts/log/LLaVA-VT
# CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/train/finetune_LLaVA-VT-13B.sh 2>&1 | tee -a scripts/log/LLaVA-VT/finetune_LLaVA-VT-13B.txt

LLaVA-VT Evaluation

<p align="center"> <img src="docs/table1.png" width=97% height=97% class="center"> </p>

Notes: Before running the evaluation scripts, please:

  1. download the evaluation data and eval.zip (following LLaVA/docs/Evaluation.md).

  2. utilize the provided visual tables for each dataset from ./playground/data_VT/eval_images_gen_vt, or utilize ./scripts/VTGenerator/infer/eval_images_gen_vt.sh to generate the visual tables for each evaluation dataset.

  3. GPT-assisted evaluation is applied for mmvet, llavabench, and mmmu. More details are provided in the corresponding evaluation scripts. Please also refer to our paper: 4.1 Comparison Experiments. Setup.

Details of the evaluation scripts (Table 1 in the main paper) are provided in ./scripts/LLaVA-VT/eval/eval_multi_datasets_with_VT.sh.

VTGenerator="VTGenerator-13B"
Model="LLaVA-VT-13B"

mkdir -p scripts/log/eval_multi_datasets_with_VT
<details> <summary>Click for more details... </summary>

Evaluation on mmvet with visual table:

CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmvet/mmvet.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmvet_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on llavabench with visual table:

CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/llavabench/llavabench.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/llavabench_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on mmmu with visual table:

CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmmu/mmmu.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmmu_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on mmbench with visual table:

CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmbench/mmbench.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmbench_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on mmvp with visual table:

CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/mmvp_mc/mmvp_mc.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/mmvp_mc_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on pope with visual table:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/pope/pope.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/pope_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on vizwiz with visual table:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/vizwiz/vizwiz.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/vizwiz_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on scienceqa with visual table:

CUDA_VISIBLE_DEVICES=0 bash scripts/LLaVA-VT/eval/scienceqa/scienceqa.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/scienceqa_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on gqa with visual table:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/gqa/gqa.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/gqa_full_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on vqav2 with visual table:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/vqav2/vqav2_dev.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/vqav2_dev_with_${VTGenerator}_gen_vt_${Model}.txt

Evaluation on textvqa with visual table:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/LLaVA-VT/eval/textvqa/textvqa.sh ${VTGenerator} ${Model} 2>&1 | tee -a scripts/log/eval_multi_datasets_with_VT/textvqa_with_${VTGenerator}_gen_vt_${Model}.txt
</details>

Citation

If you find this repo useful, please consider citing our paper:


@article{zhong2024beyond,
  title={Beyond Embeddings: The Promise of Visual Table in Multi-Modal Models},
  author={Zhong, Yiwu and Hu, Zi-Yuan and Lyu, Michael R and Wang, Liwei},
  journal={arXiv preprint arXiv:2403.18252},
  year={2024}
}

Acknowledgement