Home

Awesome

<center>HiGPT: Heterogeneous Graph Language Model</center>

<img src='images/HiGPT_article_cover.png' />

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Long Xia, Dawei Yin and Chao Huang*. (*Correspondence )

Data Intelligence Lab@University of Hong Kong, Baidu Inc.


<a href='https://HiGPT-HKU.github.io/'><img src='https://img.shields.io/badge/Project-Page-Green'></a> <a href='#'><img src='https://img.shields.io/badge/Demo-Page-purple'></a> <a href='https://arxiv.org/abs/2402.16024'><img src='https://img.shields.io/badge/Paper-PDF-orange'></a> YouTube โ€ข ๐ŸŒ <a href="https://mp.weixin.qq.com/s/rvKTFdCk719Q6hT09Caglw" target="_blank">ไธญๆ–‡ๅšๅฎข</a>

This repository hosts the code, data and model weight of HiGPT.


๐ŸŽ‰ News

๐ŸŽฏ๐ŸŽฏ๐Ÿ“ข๐Ÿ“ข We have made significant updates to the models used in our HiGPT on ๐Ÿค— Huggingface. We highly recommend referring to the table below for further details:

๐Ÿค— Huggingface Address๐ŸŽฏ Description
https://huggingface.co/Jiabin99/In-Context-HGTThe trained in-context heterogeneous graph tokenizer using our lightweight text-graph contrastive alignment.
https://huggingface.co/Jiabin99/HiGPTIt's the checkpoint of our HiGPT based on Vicuna-7B-v1.5 tuned on 60 shots IMDB graph instruction data.

๐Ÿ‘‰ TODO


<span id='introduction'/>

Brief Introduction

we present the HiGPT framework that aligns LLMs with heterogeneous graph structural knowledge with a heterogeneous graph instruction tuning paradigm.

image-20240224193025443

For more technical details, kindly refer to the paper and the project website of our Graph.


<span id='Usage'/>

Getting Started

<span id='all_catelogue'/>

Table of Contents:


<span id='Environment Preparation'/>

1. Environment Preparation <a href='#all_catelogue'>[Back to Top]</a>

Please first clone the repo and install the required environment, which can be done by running the following commands:

conda create -n higpt python=3.8

conda activate higpt

# Torch with CUDA 11.7
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
# To support vicuna base model
pip3 install "fschat[model_worker,webui]"
# To install pyg and pyg-relevant packages
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
# Clone our HiGPT
git clone https://github.com/HKUDS/HiGPT.git
cd HiGPT
# Install required libraries
pip install -r requirements.txt
<span id='Data Preparation'/>

2. Data Preparation <a href='#all_catelogue'>[Back to Top]</a>

๐ŸŽฏ (Temporarily recommended from 2024.03.25) Due to the large size of instruction data in the stage 1, we have not upload the complete stage 1 instruction data to the archive. We have an alternative way to download the complete version of stage 1 instruction data: https://pan.baidu.com/s/1rtdxzhRRcEUEpXoWKml36A?pwd=id6u. And we will upload the complete version to the archive as soon as possible. But other data used by HiGPT could be accessed by archive.

(Temporarily deprecated from 2024.03.25) The tuning data of our HiGPT consists of two parts, i.e., heterogeneous graph corpus (stage 1) and heterogeneity-aware graph instruction (stage 2). You can cd hi_datasets and run sh get_stage1_data.sh to download the data in stage 1:

cd /path/to/HiGPT/hi_datasets

wget https://archive.org/download/higpt_stage1/matching_instruction.tar.gz

tar -xzvf matching_instruction.tar.gz

rm -f matching_instruction.tar.gz

(Always recommended) Also, you can run sh get_stage1_data.sh to download the data in stage 2:

cd /path/to/HiGPT/hi_datasets

mkdir stage2_data
cd stage2_data

wget https://archive.org/download/higpt_stage2/instruct_ds_dblp.tar.gz
wget https://archive.org/download/higpt_stage2/processed_dblp.tar.gz
wget https://archive.org/download/higpt_stage2/instruct_ds_imdb.tar.gz
wget https://archive.org/download/higpt_stage2/processed_imdb.tar.gz
wget https://archive.org/download/higpt_stage2/instruct_ds_acm.tar.gz
wget https://archive.org/download/higpt_stage2/processed_acm.tar.gz

mkdir DBLP
mkdir IMDB
mkdir acm

tar -xzvf instruct_ds_dblp.tar.gz -C DBLP
tar -xzvf processed_dblp.tar.gz -C DBLP
tar -xzvf instruct_ds_imdb.tar.gz -C IMDB
tar -xzvf processed_imdb.tar.gz -C IMDB
tar -xzvf instruct_ds_acm.tar.gz -C acm
tar -xzvf processed_acm.tar.gz -C acm

rm -f instruct_ds_dblp.tar.gz
rm -f processed_dblp.tar.gz
rm -f instruct_ds_imdb.tar.gz
rm -f processed_imdb.tar.gz
rm -f instruct_ds_acm.tar.gz
rm -f processed_acm.tar.gz
<span id='Training HiGPT'/>

3. Training HiGPT <a href='#all_catelogue'>[Back to Top]</a>

HiGPT tuning paradigm consists of two stages: (1) instruction tuning with heterogeneous graph corpus; (2) heterogeneity-aware fine-tuning.

<span id='Offline Heterogeneous Graph Tokenizing'/>

3.0. Offline Heterogeneous Graph Tokenizing <a href='#all_catelogue'>[Back to Top]</a>

Since the graph tokenizer does not update parameters during the two training processes, we use the Offline Heterogeneous Graph Tokenizing method to preprocess the instruction data in order to accelerate the speed of model training. The data downloaded in <a href='#Data Preparation'>Data Preparation </a> has been processed with a pre-trained graph tokenizer. If you need to process with your own graph tokenizer, you can refer to the following commands:

cd /path/to/HiGPT

ann_path=./hi_datasets/stage2_data/IMDB/instruct_ds_imdb/ann/IMDB_test_std_0_1000.json
data_type=imdb
graph_path=./hi_datasets/stage2_data/IMDB/instruct_ds_imdb/graph_data/test

python run_offline_hgt_tokenizer_single.py --ann_path ${ann_path} \
                                 --data_type ${data_type} \
                                 --graph_path ${graph_path}
cd /path/to/HiGPT

# offline tokenizing for stage1 matching instruction
data_type=imdb
graph_root=./hi_datasets/matching_instruction
dsname_list=(instruct_ds_matching_movie instruct_ds_node_matching)
pretrained_hgt=./MetaHGT_imdb_dblp_epoch5

for dsname in "${dsname_list[@]}"
do
    python run_offline_hgt_tokenizer.py --dsname ${dsname} \
                                 --data_type ${data_type} \
                                 --pretrained_gnn_path ${pretrained_hgt} \
                                 --graph_root ${graph_root}
done

data_type=dblp
graph_root=./hi_datasets/matching_instruction
dsname_list=(instruct_ds_matching_author instruct_ds_matching_paper instruct_ds_node_matching)
pretrained_hgt=./MetaHGT_imdb_dblp_epoch5

for dsname in "${dsname_list[@]}"
do
    python run_offline_hgt_tokenizer.py --dsname ${dsname} \
                                 --data_type ${data_type} \
                                 --pretrained_gnn_path ${pretrained_hgt} \
                                 --graph_root ${graph_root}
done
<span id='Prepare Pre-trained Checkpoint'/>

3.1. Preparing Pre-trained Checkpoint <a href='#all_catelogue'>[Back to Top]</a>

HiGPT is trained based on following excellent existing models. Please follow the instructions to prepare the checkpoints.

<span id='Self-Supervised Instruction Tuning'/>

3.2. Instruction Tuning with Heterogeneous Graph Corpus <a href='#all_catelogue'>[Back to Top]</a>

You could start the first stage tuning by filling blanks at higpt_stage_1.sh. There is an example as below:

# to fill in the following path to run the first stage of our HiGPT!
#!/bin/bash
cd /path/to/HiGPT

data_path=instruct_ds_matching_author,instruct_ds_matching_movie,instruct_ds_matching_paper,instruct_ds_node_matching,instruct_ds_node_matching_imdb
graph_root=./hi_datasets/matching_instruction
output_dir=./checkpoints/higpt-metahgt-stage1-7b-dblp-imdb-epoch1
base_model=/path/to/vicuna-7b-v1.5-16k

wandb offline

python3.8 -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    higpt/train/train_hete_nopl.py \
    --model_name_or_path ${base_model} \
    --version v1 \
    --data_path ${data_path} \
    --graph_content /root/paddlejob/workspace/env_run/llm/GraphChat/playground/data/arxiv_ti_ab.json \
    --graph_root ${graph_root} \
    --graph_tower ${graph_tower} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end True \
    --bf16 False \
    --output_dir ${output_dir} \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0.0001 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb \
    --hetero_key_path /root/paddlejob/workspace/env_run/output/sample_instruct_ds/ann/hetero_key_order.json
<span id='Extract the Trained Projector'/>

3.3. Extract the Trained Projector <a href='#all_catelogue'>[Back to Top]</a>

We could extract the trained projector in the stage 1 by filling blanks at extract_projector.sh. There is an example as below:

# to fill in the following path to extract projector for the first tuning stage!
#!/bin/bash
cd /path/to/HiGPT
stage1_model=./checkpoints/higpt-metahgt-stage1-7b-dblp-imdb-epoch1
graph_projector=./checkpoints/higpt_stage1_projector_metahgt_dblp_imdb_epoch1/higpt-metahgt-stage1-7b.bin

python3.8 ./scripts/extract_graph_projector.py \
  --model_name_or_path ${stage1_model} \
  --output ${graph_projector}
<span id='Task-Specific Instruction Tuning'/>

3.4. Heterogeneity-aware Fine-tuning <a href='#all_catelogue'>[Back to Top]</a>

You could start the second stage tuning based on different number of shots (e.g., 1, 3, 5, 10, 20, 40, 60) by filling blanks at higpt_stage_2.sh. There is an example as below:

# to fill in the following path to run the second stage of our HiGPT!
#!/bin/bash
cd /path/to/HiGPT
base_model=/path/to/vicuna-7b-v1.5-16k
graph_root=./hi_datasets/stage2_data/IMDB
graph_projector=./checkpoints/higpt_stage1_projector_metahgt_dblp_imdb_epoch1/higpt-metahgt-stage1-7b.bin

num_epochs=15
num_shot_list=(1 3 5 10 20 40 60)
for num_shot in "${num_shot_list[@]}"
do
    python3.8 -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20010 higpt/train/train_hete_nopl.py \
    --model_name_or_path ${base_model} \
    --version v1 \
    --data_path instruct_ds_imdb \
    --graph_content /root/paddlejob/workspace/env_run/llm/GraphChat/playground/data/arxiv_ti_ab.json \
    --graph_root ${graph_root} \
    --graph_tower MetaHGT_imdb_dblp_epoch5 \
    --pretrain_graph_mlp_adapter ${graph_projector} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end True \
    --bf16 False \
    --output_dir ./checkpoints/higpt-stage2-imdb-metahgt-epoch${num_epochs}-mixcot-true-${num_shot} \
    --num_train_epochs ${num_epochs} \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0.0001 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb \
    --hetero_key_path /root/paddlejob/workspace/env_run/output/sample_instruct_ds/ann/hetero_key_order.json \
    --num_shot ${num_shot} 
done

<span id='Evaluating HiGPT'/>

5. Evaluating HiGPT <a href='#all_catelogue'>[Back to Top]</a>

<span id='Preparing Checkpoints'/>

5.1. Preparing Checkpoints <a href='#all_catelogue'>[Back to Top]</a>

Checkpoints: You could try to evaluate HiGPT by using your own model or our released checkpoints.

<span id='Running Evaluation'/>

5.2. Running Evaluation <a href='#all_catelogue'>[Back to Top]</a>

# to fill in the following path to extract projector for the second tuning stage!
#!/bin/bash
cd /path/to/HiGPT
output_model=./checkpoints
datapath=./hi_datasets/stage2_data/IMDB
res_path=./output_res_imdb
num_epochs=15
num_shot_list=(1 3 5 10 20 40 60)
for num_shot in "${num_shot_list[@]}"
do
  for ((cot_case=0; cot_case<=0; cot_case++))
  do
    python3.8 ./higpt/eval/run_higpt.py --model-name ${output_model}/higpt-stage2-imdb-metahgt-epoch${num_epochs}-mixcot-true-${num_shot}  --prompting_file ${datapath}/instruct_ds_imdb/ann_processed_MetaHGT_imdb_dblp_epoch5/cot_test/IMDB_test_std_0_1000_cot_${cot_case}.json --graph_root ${datapath} --output_res_path ${res_path}/imdb_test_res_epoch_${num_epochs}_std_${num_shot}_shot_cot_${cot_case} --start_id 0 --end_id 1000 --num_gpus 4
  done
done
#!/bin/bash
cd /path/to/HiGPT
output_model=./checkpoints
datapath=./hi_datasets/stage2_data/IMDB
res_path=./output_res_imdb
num_epochs=15
num_shot_list=(1 3 5 10 20 40 60)
cot_case_list=(0)
incontext_dir=in_context_1_shot
for num_shot in "${num_shot_list[@]}"
do
  for cot_case in "${cot_case_list[@]}"
  do
    python3.8 ./higpt/eval/run_higpt_incontext.py --model-name ${output_model}/hetegpt-stage2-imdb-metahgt-epoch${num_epochs}-mixcot-true-${num_shot}  --prompting_file ${datapath}/instruct_ds_imdb/ann_processed_MetaHGT_imdb_dblp_epoch5/${incontext_dir}/IMDB_test_std_0_1000_cot_${cot_case}_in_context.json --graph_root ${datapath} --output_res_path ${res_path}_${incontext_dir}/imdb_test_res_epoch_${num_epochs}_std_${num_shot}_shot_cot_${cot_case} --start_id 0 --end_id 1000 --num_gpus 4
  done
done


Contact

For any questions or feedback, feel free to contact Jiabin Tang.

Citation

If you find HiGPT useful in your research or applications, please kindly cite:

@articles{tang2024higpt,
title={HiGPT: Heterogeneous Graph Language Model}, 
author={Jiabin Tang and Yuhao Yang and Wei Wei and Lei Shi and Long Xia and Dawei Yin and Chao Huang},
year={2024},
eprint={2402.16024},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

Acknowledgements

You may refer to related work that serves as foundations for our framework and code repository, Vicuna, LLaVa, GraphGPT, We also partially draw inspirations from MiniGPT-4. The design of our website and README.md was inspired by NExT-GPT. Thanks for their wonderful works.