Home

Awesome

๐ŸŒŸ GITA: Graph to Image-Text Integration for Vision-Language Graph Reasoning ๐ŸŒŸ

Welcome to the forefront of vision-language graph reasoning! We're thrilled to introduce this promising new topic that connects the VLM, reasoning, and graph communities. Dive in to explore our pioneering contributions!

๐Ÿš€ Contribution 1: GVLQA Benchmark ๐Ÿš€

Introducing GVLQA Benchmark, the first-ever vision-language reasoning benchmark designed for general graph reasoning. This is a monumental step forward in the field! ๐ŸŽ‰

๐Ÿ”— Download Now: Access the GVLQA datasets from our Hugging Face Collection.

๐Ÿค– Contribution 2: GITA 7B/13B ๐Ÿค–

Introducing GITA-7B/13B, a groundbreaking series of Vision-Language Models crafted specifically for vision-language graph reasoning. These models are expertly fine-tuned on the GVLQA datasets using the powerful LLaVA-1.5 backbone. ๐ŸŽ‰

GITA-7B/13B are Pre-Trained Vision-Language Models with Graph Structural Understanding

GITA-7B/13B are pre-trained vision-language models uniquely equipped with graph structural understanding. Their ability to perceive and process graph structures distinguishes them as a robust starting point for any project requiring advanced graph reasoning capabilities.

Model Zoo

We include the finetuned weights of GITA-7B/13B (LoRa Adaptor and projector) in the checkpoints/Vision_Text/GVLQA_BASE directory, they should be used together with LLaVA-v1.5 as we did in /llava/custom_eval/eval.py line 201 where invoke the method load_pretrained_model.

To conveniently use GITA-7B/13B as pre-trained models for downstream graph problems, we also offer the packed version, where all weights from both the GITA modifications and the original LLaVA weights are packed into a single comprehensive model. Explore our Model Zoo for seamless access:

๐Ÿ› ๏ธ Install

conda create -n gita python=3.10 -y
conda activate gita
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
pip install torch_geometric==2.5.3
pip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.0.1+cu117.html
sudo apt install graphviz

๐Ÿ“‚ File Structure

Please organize the data as follows:

โ”œโ”€โ”€ local_llm
โ”‚   โ”œโ”€โ”€ llava-v1.5-7b
โ”‚   โ”œโ”€โ”€ llava-v1.5-13b
โ”‚   โ”œโ”€โ”€ vicuna-v1.5-7b
โ”‚   โ”œโ”€โ”€ vicuna-v1.5-13b
โ”œโ”€โ”€ dataset
โ”‚   โ”œโ”€โ”€ GVLQA-BASE
โ”‚   โ”œโ”€โ”€ GVLQA-AUGET
โ”‚   โ”œโ”€โ”€ GVLQA-AUGLY
โ”‚   โ”œโ”€โ”€ GVLQA-AUGNO
โ”‚   โ”œโ”€โ”€ GVLQA-AUGNS
โ”‚   โ”œโ”€โ”€ NODECLS
โ”‚   โ”œโ”€โ”€ LINKPRED
โ”‚   โ””โ”€โ”€ ...(any custom datasets, applying GITA on existing graph data to generate their vision-language version)
โ””โ”€โ”€ GITA
    โ”œโ”€โ”€ answer
    โ”œโ”€โ”€ checkpoints
    โ”œโ”€โ”€ fastchat
    โ”œโ”€โ”€ llava
    โ””โ”€โ”€ scripts

๐Ÿ”„ Reproduction

Before reproduction, please download the GVLQA datasets from Hugging Face. If you do not want to use visual-graph-based augmentations, downloading GVLQA-BASE is sufficient.

๐Ÿ‹๏ธโ€โ™‚๏ธ Training

To reproduce the experimental results, you can run the scripts in the ./scripts folder, which includes training and evaluation scripts.

Step 1: GPU Configuration

Specify the gpu_ids in finetune_lora_loop.sh:

gpu_ids=(
    "0,1,2,3,4,5,6,7"
)

For a single GPU:

gpu_ids=(
    "0"
)

Step 2: Task Specification

Modify hyper_1 in finetune_lora_loop.sh:

Example for "cycle":

declare -a hyper_1=(
    "cycle"
)

Step 3: Hyperparameter Configuration

Specify the hyperparameters in hyper_2:

MODELSIZE
EPOCH  # Epoches from {1,5,10,20,30,50}
BSZ    # per-device train batch size from {16,32}. However, due to the use of gradient accumulation, the actual total batch size remains constant at 128, regardless of the specific per-device batch size value chosen.
LORAR  # The rank of the low-rank matrices used in the LoRA adaptation
LORAALPHA  # The scaling factor that controls the magnitude of the low-rank adaptation
MODALTYPE  # Text_Only, Vision_Only, Vision_Text (both image and text)
TASKTYPE  # GVLQA-BASE, GVLQA-AUGET, GVLQA-AUGLY, GVLQA-AUGNO, GVLQA-AUGNS; NODECLS; LINKPRED
UNFREEZEV  # Optional: Fine-tune vision tower or not when Vision_Only or Vision_Text. If True, yes.
LAYOUTAUG  # Optional: Whether to use layout augmentation online. If True, yes.

Refer to the following tables for exact configurations:

GITA-7B:

Taskhyper_2 Configuration
connectivity7b 1 16 128 256 Vision_Text GVLQA-BASE False False
cycle7b 20 16 128 256 Vision_Text GVLQA-BASE False False
topology7b 10 16 128 256 Vision_Text GVLQA-BASE False False
shortest_path7b 10 16 128 256 Vision_Text GVLQA-BASE False False
flow7b 20 16 128 256 Vision_Text GVLQA-BASE False False
matching7b 5 16 128 256 Vision_Text GVLQA-BASE False False
hamilton7b 30 16 128 256 Vision_Text GVLQA-BASE False False

GITA-13B:

Taskhyper_2 Configuration
connectivity13b 1 16 128 256 Vision_Text GVLQA-BASE False False
cycle13b 10 16 128 256 Vision_Text GVLQA-BASE False False
topology13b 10 16 128 256 Vision_Text GVLQA-BASE False False
shortest_path13b 10 16 128 256 Vision_Text GVLQA-BASE False False
flow13b 10 16 128 256 Vision_Text GVLQA-BASE False False
matching13b 50 16 128 256 Vision_Text GVLQA-BASE False False
hamilton13b 30 16 128 256 Vision_Text GVLQA-BASE False False

GITA-7B on GVLQA-AUGLY (Layout Augmentation):

Taskhyper_2 Configuration
connectivity7b 10 16 64 16 Vision_Only GVLQA-AUGLY True False
cycle7b 10 16 128 256 Vision_Only GVLQA-AUGLY False False
topology7b 1 16 128 256 Vision_Only GVLQA-AUGLY False False
shortest_path7b 20 16 128 256 Vision_Only GVLQA-AUGLY False False
flow7b 1 16 64 16 Vision_Only GVLQA-AUGLY True False
matching7b 20 16 128 256 Vision_Only GVLQA-AUGLY False False
hamilton7b 30 16 64 16 Vision_Only GVLQA-AUGLY False False

Visual Graph Augmentation Variants (AUGNO, AUGNS, AUGET) and Modality Variants (Vision-Only):

Step 4: Execute Training

cd GITA
bash ./scripts/train/finetune_lora_loop.sh

๐Ÿงช Evaluation

Follow the same instructions as Training to specify gpu_ids, hyper_1, and hyper_2 in eval_loop.sh.

cd GITA
bash ./scripts/eval/eval_loop.sh

For zero-shot GITA, set EPOCH BSZ LORAR UNFREEZEV LAYOUTAUG in hyper_2 as none.

Example for zero-shot GITA-7B Vision-Only on GVLQA-BASE:

7b none none none 16 Vision_Only GVLQA-Base none none

๐Ÿ“œ Cite Us

@inproceedings{wei2024gita,
  title={Gita: Graph to visual and textual integration for vision-language graph reasoning},
  author={Wei, Yanbin and Fu, Shuai and Jiang, Weisen and Zhang, Zejian and Zeng, Zhixiong and Wu, Qi and Kwok, James and Zhang, Yu},
  booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
  year={2024}
}

We hope you find our repository engaging and insightful. Your journey into the realm of vision-language graph reasoning starts here! ๐Ÿš€

Feel free to explore, contribute, and be a part of this exciting new venture! โœจ