Awesome

VEGA<img src="assets/lyra.png" alt="Icon" width="25" height="25">: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Project Page | Paper | Dataset

We introduce a new multimodal task named Interleaved Image-Text Comprehension (IITC), designed to evaluate a model's capability to handle interleaved image-text inputs that contain redundant and misleading information. To enhance and measure model performance on the IITC task, we developed the VEGA dataset. By fine-tuning Qwen-VL-Chat on the VEGA dataset, we created VEGA-Base, a strong baseline for the IITC task.

Dataset Structure

VEGA Datasets consists of 2 tasks, approximately 593,000 training examples and 2,326 test examples. You can download VEGA here.
Unzip the imgs.zip and you will get the folder.

.
├── datas
│   ├── IITC_4k_test.json
│   ├── IITC_4k_train.json
│   ├── IITC_8k_test.json
│   ├── IITC_8k_train.json
│   ├── ITA_3picture_C_train.json
│   ├── ITA_3picture_E_train.json
│   ├── ITA_3picture_F_train.json
│   ├── ITA_3picture_test.json
│   ├── ITA_5picture_C_train.json
│   ├── ITA_5picture_E_train.json
│   ├── ITA_5picture_F_train.json
│   └── ITA_5picture_test.json
├── imgs
│   ├── test_imgs
│   │   ├── 1001.0025v1
│   │   │   └── pdferror.png
│   │   ├── 1001.0357v1
│   │   │   └── Different_Capacity_regions_2dB.png
...
|   ├── train_imgs
...

The data in IITC*.json follows the following format:

{"id": "The paper's ID on arXiv.", 
"title": "The paper's title.", 
"caption": "The caption of correct image.",
"context": "Interleaved image-text input.",
"question": "Question about a specific image.", 
"answer": "The answer."
"image_paths": "List of image paths.",
"truth_fig_idx": "Index of the correct image in image_paths."
}

The data in ITA*.json follows the following format:

{"id": "List of paper's ID on arXiv.", 
"image_paths": "List of image paths.", 
"context": "Interleaved image-text input.", 
"answer": "The answer."
}

In all the JSON "context" fields, the picture is represented as "Picture id: <img>img_path</img>\n" where "id" indicates the position of the image in the conversation, starting from 1. For example:

{
"context": "...The result illustrated in Figure~6[Picture 1] shows that the proposed network extracting patches features separately performs significantly better than previous methods extracting patches feature together.\nPicture 1: <img>test_imgs/1803.06598v1/Figs/stack_LAN.png</img>\nFigure. 6 Picture 2: <img>test_imgs/1803.06598v1/Figs/SIR_VS_CR_curve.png</img>\nFigure. 7...", 
}

Evaluation

git clone https://github.com/zhourax/VEGA
cd VEGA
pip install nltk
pip install rouge

After setting your model in eval/IITC.py and eval/ITA.py.

bash eval/IITC.sh
bash eval/ITA.sh

Citation

@misc{zhou2024vegalearninginterleavedimagetext,
      title={VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models}, 
      author={Chenyu Zhou and Mengdan Zhang and Peixian Chen and Chaoyou Fu and Yunhang Shen and Xiawu Zheng and Xing Sun and Rongrong Ji},
      year={2024},
      eprint={2406.10228},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.10228}, 
}