Home

Awesome

<center><img src="images/graphgpt.png" style="width: 5%"> GraphGPT: Graph Instruction Tuning for Large Language Models</center>

<div align='center'> <a href='https://tjb-tech.github.io/'>Jiabin Tang</a>, <a href='http://yuh-yang.github.io'>Yuhao Yang</a>, <a href='#'>Wei Wei</a>, <a href='#'>Lei Shi</a>, <a href='#'>Suqi Cheng</a>, <a href='https://www.yindawei.com/'>Dawei Yin</a> and <a='https://sites.google.com/view/chaoh/home'>Chao Huang*</a>. (*Correspondence )

<strong><a href='https://sites.google.com/view/chaoh/home'>Data Intelligence Lab</a>@<a href='https://www.hku.hk/'>University of Hong Kong</a></strong>, Baidu Inc.

<a href='https://graphgpt.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a> <a href='https://arxiv.org/abs/2310.13023'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a> YouTube <a href='https://mp.weixin.qq.com/s/rvKTFdCk719Q6hT09Caglw' target='_blank'><img src='https://img.shields.io/badge/中文-博客-blue'></a>

<img src='images/GraphGPT_notext.jpeg' /> <!-- [Jiabin Tang](https://tjb-tech.github.io/), [Yuhao Yang](http://yuh-yang.github.io), [Wei Wei](#), [Lei Shi](#), [Lixin Su](#), [Suqi Cheng](#), [Dawei Yin](https://www.yindawei.com/) and [Chao Huang](https://sites.google.com/view/chaoh/home)*. (*Correspondence ) **[Data Intelligence Lab](https://sites.google.com/view/chaoh/home)@[University of Hong Kong](https://www.hku.hk/)**, Baidu Inc. ----- <a href='https://graphgpt.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a> <a href='https://arxiv.org/abs/2310.13023'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a> [![YouTube](https://badges.aleen42.com/src/youtube.svg)](#) • 🌐 <a href="https://mp.weixin.qq.com/s/rvKTFdCk719Q6hT09Caglw" target="_blank">中文博客</a> -->

This repository hosts the code, data and model weight of GraphGPT (SIGIR'24 full paper track).

<!-- ----------- --> </div>

🎉 News

0. Environment Update:

The lightweight training requires PyTorch 2.1+, so we need to update corresponding libraries:

# if you have set up the env for GraphGPT earlier
pip uninstall torch
pip uninstall torchvision
pip uninstall torchaudio
# CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# update pyg for the PyTorch 2.1+
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

# install lightning
pip install lightning

1. Update the Graph Data

Due to compatibility issues, if you are using the previously released graph data, we recommend downloading and updating it according to the provided link: updated graph data.

2. Run the Scripts

You can run the scripts as follow:

Stage-1:

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage1.sh

Stage-2:

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage2.sh
<details> <summary> <b>FQA</b> </summary> </details>

🎯🎯📢📢 We have made significant updates to the models and data used in our GraphGPT on 🤗 Huggingface. We highly recommend referring to the table below for further details:

🤗 Huggingface Address🎯 Description
huggingface.co/Jiabin99/GraphGPT-7B-mix-allIt's the checkpoint of our GraphGPT based on Vicuna-7B-v1.5 tuned on instruction data Arxiv-PubMed-mix-NC-LP
huggingface.co/Jiabin99/Arxiv-PubMed-GraphCLIP-GTIt's the checkpoint of the pre-trained graph transformer (GT) trained on Arxiv and PubMed using Text-Graph grounding.
huggingface.co/datasets/Jiabin99/Arxiv-PubMed-mix-NC-LPThis's the mixing instruction dataset with node classification (NC) and link prediction (LP) on Arxiv and PubMed.
huggingface.co/datasets/Jiabin99/GraphGPT-eval-instructionWe release all instruction dataset for our evaluation.
huggingface.co/datasets/Jiabin99/All_pyg_graph_dataWe merge all utilized graph data.
huggingface.co/datasets/Jiabin99/graph-matchingThis is the instruction data used in graph-matching stage.

👉 TODO


<span id='introduction'/>

Brief Introduction

we present the GraphGPT framework that aligns LLMs with graph structural knowledge with a graph instruction tuning paradigm.

For more technical details, kindly refer to the paper and the project website of our Graph.


<span id='Usage'/>

Getting Started

<span id='all_catelogue'/>

Table of Contents:


<span id='Code Structure'/>

1. Code Structure <a href='#all_catelogue'>[Back to Top]</a>

.
├── README.md
├── assets
│   ├── demo_narrow.gif
│   ├── screenshot_cli.png
│   ├── screenshot_gui.png
│   ├── server_arch.png
│   └── vicuna_logo.jpeg
├── format.sh
├── graphgpt
│   ├── __init__.py
│   ├── constants.py
│   ├── conversation.py
│   ├── eval
│   │   ├── README.md
│   │   ├── requirements.txt
│   │   ├── run_graphgpt.py
│   │   ├── run_graphgpt_LP.py
│   │   ├── run_vicuna.py
│   │   └── script
│   │       └── run_model_qa.yaml
│   ├── model
│   │   ├── GraphLlama.py
│   │   ├── __init__.py
│   │   ├── apply_delta.py
│   │   ├── apply_lora.py
│   │   ├── builder.py
│   │   ├── compression.py
│   │   ├── convert_fp16.py
│   │   ├── graph_layers
│   │   │   ├── __init__.py
│   │   │   ├── bpe_simple_vocab_16e6.txt.gz
│   │   │   ├── clip_graph.py
│   │   │   ├── graph_transformer.py
│   │   │   ├── mpnn.py
│   │   │   └── simple_tokenizer.py
│   │   ├── make_delta.py
│   │   ├── model_adapter.py
│   │   ├── model_registry.py
│   │   ├── monkey_patch_non_inplace.py
│   │   └── utils.py
│   ├── protocol
│   │   └── openai_api_protocol.py
│   ├── serve
│   │   ├── __init__.py
│   │   ├── api_provider.py
│   │   ├── bard_worker.py
│   │   ├── cacheflow_worker.py
│   │   ├── cli.py
│   │   ├── controller.py
│   │   ├── gateway
│   │   │   ├── README.md
│   │   │   └── nginx.conf
│   │   ├── gradio_block_arena_anony.py
│   │   ├── gradio_block_arena_named.py
│   │   ├── gradio_css.py
│   │   ├── gradio_patch.py
│   │   ├── gradio_web_server.py
│   │   ├── gradio_web_server_multi.py
│   │   ├── huggingface_api.py
│   │   ├── inference.py
│   │   ├── model_worker.py
│   │   ├── monitor
│   │   │   ├── basic_stats.py
│   │   │   ├── clean_battle_data.py
│   │   │   ├── elo_analysis.py
│   │   │   ├── hf_space_leaderboard_app.py
│   │   │   └── monitor.py
│   │   ├── openai_api_server.py
│   │   ├── register_worker.py
│   │   ├── test_message.py
│   │   └── test_throughput.py
│   ├── train
│   │   ├── graphchat_trainer.py
│   │   ├── llama_flash_attn_monkey_patch.py
│   │   ├── train_graph.py
│   │   ├── train_lora.py
│   │   └── train_mem.py
│   └── utils.py
├── playground
│   ├── inspect_conv.py
│   ├── test_embedding
│   │   ├── README.md
│   │   ├── test_classification.py
│   │   ├── test_semantic_search.py
│   │   └── test_sentence_similarity.py
│   └── test_openai_api
│       ├── anthropic_api.py
│       └── openai_api.py
├── pyproject.toml
├── scripts
│   ├── eval_script
│   │   └── graphgpt_eval.sh
│   ├── extract_graph_projector.py
│   ├── serving
│   │   ├── controller.yaml
│   │   └── model_worker.yaml
│   └── tune_script
│       ├── extract_projector.sh
│       ├── graphgpt_stage1.sh
│       └── graphgpt_stage2.sh
└── tests
    ├── test_openai_curl.sh
    ├── test_openai_langchain.py
    └── test_openai_sdk.py
<span id='Environment Preparation'/>

2. Environment Preparation <a href='#all_catelogue'>[Back to Top]</a>

Please first clone the repo and install the required environment, which can be done by running the following commands:

conda create -n graphgpt python=3.8

conda activate graphgpt

# Torch with CUDA 11.7
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
# To support vicuna base model
pip3 install "fschat[model_worker,webui]"
# To install pyg and pyg-relevant packages
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
# Clone our GraphGPT
git clone https://github.com/HKUDS/GraphGPT.git
cd GraphGPT
# Install required libraries
pip install -r requirements.txt
<span id='Training GraphGPT'/>

3. Training GraphGPT <a href='#all_catelogue'>[Back to Top]</a>

GraphGPT tuning paradigm consists of two stages: (1) self-supervised instruction tuning; (2) task-specific instruction tuning.

<span id='Prepare Pre-trained Checkpoint'/>

3.1. Preparing Pre-trained Checkpoint <a href='#all_catelogue'>[Back to Top]</a>

GraphGPT is trained based on following excellent existing models. Please follow the instructions to prepare the checkpoints.

<span id='Self-Supervised Instruction Tuning'/>

3.2. Self-Supervised Instruction Tuning <a href='#all_catelogue'>[Back to Top]</a>

# to fill in the following path to run the first stage of our GraphGPT!
model_path=../vicuna-7b-v1.5-16k
instruct_ds=./data/stage_1/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
output_model=./checkpoints/stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb
<span id='Extract the Trained Projector'/>

3.3. Extract the Trained Projector <a href='#all_catelogue'>[Back to Top]</a>

We could extract the trained projector in the stage 1 by filling blanks at extract_projector.sh. There is an example as below:

# to fill in the following path to extract projector for the first tuning stage!
src_model=./checkpoints/stage_1
output_proj=./checkpoints/stage_1_projector/stage_1_projector.bin

python3.8 ./scripts/extract_graph_projector.py \
  --model_name_or_path ${src_model} \
  --output ${output_proj}
<span id='Task-Specific Instruction Tuning'/>

3.4. Task-Specific Instruction Tuning <a href='#all_catelogue'>[Back to Top]</a>

# to fill in the following path to run the second stage of our GraphGPT!
model_path=../vicuna-7b-v1.5-16k
instruct_ds=./data/stage_2/data_all_mix.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
tuned_proj=./checkpoints/stage_1_projector/stage_1_projector.bin
output_model=./checkpoints/stage_2

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --pretrain_graph_mlp_adapter ${tuned_proj} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end True\
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

<span id='Evaluating GraphGPT'/>

4. Evaluating GraphGPT <a href='#all_catelogue'>[Back to Top]</a>

<span id='Preparing Checkpoints and Data'/>

4.1. Preparing Checkpoints and Data <a href='#all_catelogue'>[Back to Top]</a>

<span id='Running Evaluation'/>

4.2. Running Evaluation <a href='#all_catelogue'>[Back to Top]</a>

You could start the second stage tuning by filling blanks at graphgpt_eval.sh. There is an example as below:

# to fill in the following path to extract projector for the second tuning stage!
output_model=./checkpoints/stage_2
datapath=./data/eval/arxiv_nc.json
graph_data_path=./graph_data/all_graph_data.pt
res_path=./output_stage_2_arxiv_nc
start_id=0
end_id=20000
num_gpus=2

python3.8 ./graphgpt/eval/run_graphgpt.py --model-name ${output_model}  --prompting_file ${datapath} --graph_data_path ${graph_data_path} --output_res_path ${res_path} --start_id ${start_id} --end_id ${end_id} --num_gpus ${num_gpus}

Contact

For any questions or feedback, feel free to contact Jiabin Tang.

Misc

<div align="center">

Stargazers repo roster for @HKUDS/GraphGPT

Forkers repo roster for @HKUDS/GraphGPT

Star History Chart

</div>

Citation

If you find GraphGPT useful in your research or applications, please kindly cite:

@articles{tang2023graphgpt,
title={GraphGPT: Graph Instruction Tuning for Large Language Models}, 
author={Jiabin Tang and Yuhao Yang and Wei Wei and Lei Shi and Lixin Su and Suqi Cheng and Dawei Yin and Chao Huang},
year={2023},
eprint={2310.13023},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

Acknowledgements

You may refer to related work that serves as foundations for our framework and code repository, Vicuna, LLaVa, We also partially draw inspirations from MiniGPT-4. For the text-graph grounding design, we leverages implementation from G2P2. The design of our website and README.md was inspired by NExT-GPT. Thanks for their wonderful works.