Home

Awesome

Code and Datasets for Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights

Notes: we find a parameter error for previous evaluation for Prodigy, please see our updated results and new commands. (You must set the task to classification and use eval_only True for evaluation otherwise there will be some leakage problem)

This is the code repo accompanying our paper "Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights."

We implement the following graph foundation model building blocks.

We support the following two scenarios.

Install

pip install -r requirements.txt

Datasets

We follow OneForAll's way of managing the datasets. We support the following datasets.

Name#Graphs#Nodes#EdgesDomainsTasks#classes
Cora1270810556CS CitationNode, Link7
CiteSeer131868450CS CitationNode, Link6
Arxiv11693432315598CS CitationNode, Link40
Arxiv2314619877726CS CitationNode, Link40
History141551503180E-commerceNode, Link12
Child1768752325044E-commerceNode, Link24
Computers1872291256548E-commerceNode, Link10
Photo148362873782E-commerceNode, Link12
Sportsfit11730553020134E-commerceNode, Link13
Products131651319337722E-commerceNode, Link39
Amazon Ratings124492186100E-commerceNode, Link5
Pubmed11971788648Bio CitationNode, Link3
WikiCS111701431726KnowledgeNode, Link10
Tolokers1117581038000AnomalyNode, Link2
DBLP114376431326CS CitationNode, Link4
CheMBL36506526112BiologyGraph1048
PCBA4370922656BiologyGraph128
HIV411272655BiologyGraph2
Tox2178311939BiologyGraph12
Bace15133474BiologyGraph2
Bbbp20392452BiologyGraph2
Muv930872453BiologyGraph17
Toxcast85751939BiologyGraph588

The processed file versions can be achieved from the following link.

Structures of the processed files:

geometric_data_processed.pt is the core storage object, and node_text_feat stores the processed node features. data.pt contains the index file used to query the attributes stored in geometric_data_processed.pt. A comprehensive introduction of each column can be found in OneForAll's repo.

To prepare the data, it's okay to generate all raw files yourself (run oneforall for 1 epoch, including all datasets). I recommend you use the preprocessed files directly and unzip them to the main directory.

Code Structures

Directories

Main entries

Reproduce the results

OneForAll

LLaGA

  1. Use llm_train.sh to generate checkpoints
  2. Use llm_eval.sh or llm_eval_link.sh to generate the answer files for node/link-level tasks. For example, bash llm_eval.sh citeseer nc ./checkpoints/llaga-mistral-7b-hf-sbert-4-hop-token-linear-cora.3-citeseer.4-pubmed.3-nc-lp-projector/ citationcross
  3. Use llmres.sh to calculate the results

GCN-link

python3 fulllink.py --pre_train_datasets "cora-link" "citeseer-link" "pubmed-link" "arxiv-link" "arxiv23-link" "bookhis-link" "bookchild-link" "sportsfit-link" "products-link" "elecomp-link" "elephoto-link" --encoder gcn --num_layers 3 --num_hidden 128 --batch_size 512

BUDDY/SEAL

python3 linkpred.py --pre_train_datasets cora citeseer arxiv arxiv23 bookhis bookchild elecomp elephoto sportsfit products pubmed wikics --model BUDDY --cache_subgraph_features --max_hash_hops 3 --epochs 50
python3 linkpred.py --pre_train_datasets cora --model SEALGCN --hidden_channels 256 --num_hops 3

SSL

Check the best hyper-parameter in the paper (use cpuinf can do full-batch inference on CPU, which is faster on our environment)

python3 sslmain.py --pre_train_datasets arxiv sportsfit products --method graphmae --num_heads 4 --num_out_heads 1 --num_layers 3 --num_hidden 1024 --residual --in_drop 0.5 --attn_drop 0.5 --norm 'batchnorm' --lr 0.01 --weight_decay 1e-5 --activation 'prelu' --mask_rate 0.75 --drop_edge_rate 0 --replace_rate 0.2 --scheduler --lrtype 'cosine' --save_model --max_epoch 5 --subgraph_size 1024 --warmup --cpuinf

Prodigy

pretrain on arxiv

python experiments/run_single_experiment.py --dataset arxiv --root <root> --original_features False -ds_cap 24000 -val_cap 100 -test_cap 100 --emb_dim 256 --epochs 1 -ckpt_step 1000 -layers S2,U,M -lr 3e-4 -way 30 -shot 3 -qry 4 -eval_step 5000 -task cls_nm_sb -bs 1 -aug ND0.5,NZ0.5 -aug_test True -attr 1000 --device 0 --prefix MAG_PT_PRODIGY

test on History

python3 experiments/run_single_experiment.py --dataset bookhis --original_features True -ds_cap 300 -val_cap 300 -test_cap 300 --emb_dim 256 --epochs 1 -ckpt_step 1000 -layers S2,U,M -lr 3e-4 -way 12 -shot 3 -qry 4 -eval_step 50 -task classification  -bs 1 -aug ND0.5,NZ0.5 -aug_test True -attr 1000 --device 0 --prefix test --root <root> -pretrained <ckpt> --eval_only True

Acknowledgements

This code repo is heavily based on OneForAll(✨), BUDDY, LLaGA, GraphMAE, Prodigy, CSTAG. Thanks for their sharing!