Awesome

CS-TAG

CS-TAG is a project to share the public text-attributed graph (TAG) datasets and benchmark the performance of the different baseline methods. We welcome more to share datasets that are valuable for TAGs research.

Datasets 🔔

We collect and construct 8 TAG datasets from ogbn-arxiv, amazon, dblp and goodreads. Now you can go to the 'Files and version' in CSTAG to find the datasets we upload! In each dataset folder, you can find the csv file (which save the text attribute of the dataset), pt file (which represent the dgl graph file), and the Feature folder (which save the text embedding we extract from the PLM). You can use the node initial feature we created, and you also can extract the node feature from our code. For a more detailed and clear process, please clik there.😎

Environments

You can quickly install the corresponding dependencies

conda env create -f environment.yml

Pipeline 🎮

We describe below how to use our repository to perform the experiments reported in the paper. We are also adjusting the style of the repository to make it easier to use. (Please complete the 'Datasets and Feature part' above first)

1. GNN for Node Classification/Link Prediction

You can use 'ogbn-arxiv', 'Children', 'History', 'Fitness', 'Photo', 'Computers', 'webkb-cornell', 'webkb-texas', 'webkb-washington' and 'webkb-wisconsin' for the '--data_name'.

python GNN/GNN.py --data_name=Photo --dropout=0.2 --lr=0.005 --model_name=SAGE --n-epochs=1000 --n-hidden=256 --n-layers=3 --n-runs=5 --use_PLM=data/CSTAG/Photo/Feature/Photo_roberta_base_512_cls.npy

python GNN/GNN_Link.py --use_PLM=data/CSTAG/Photo/Feature/Photo_roberta_base_512_cls.npy --path=data/CSTAG/Photo/LinkPrediction/ --graph_path=data/CSTAG/Photo/Photo.pt --gnn_model=GCN

2. PLM for Classification Tasks

CUDA_VISIBLE_DEVICES=0,1 /usr/bin/env python sweep/dist_runner.py LMs/trainLM.py --att_dropout=0.1 --cla_dropout=0.1 --dataset=Computers_RS --dropout=0.1 --epochs=4 --eq_batch_size=180 --eval_patience=20000 --grad_steps=1 --label_smoothing_factor=0.1 --lr=4e-05 --model=Deberta --per_device_bsz=60 --per_eval_bsz=1000 --train_ratio=0.2 --val_ratio=0.1 --warmup_epochs=1 --gpus=0,1 --wandb_name OFF --wandb_id OFF

3. TMLM for PreTraining

for update and debug

4. TDK for PreTraining

for update and debug

5. TCL for PreTraining

CUDA_VISIBLE_DEVICES=0,1 /usr/bin/env python sweep/dist_runner.py LMs/Train_Command/train_CL.py --PrtMode=TCL --att_dropout=0.1 --cla_dropout=0.1 --dataset=Photo_RS --dropout=0.1 --epochs=5 --eq_batch_size=60 --per_device_bsz=15 --grad_steps=2 --lr=5e-05 --model=Bert --warmup_epochs=1 --gpus=0,1 --cache_dir=exp/TCL/Photo/Bert_base/

6. TMDC for Training

for update and debug

Create Your Model

If you want to add your own model to this code base, you can follow the steps below:

Add your GNN model:

In GNN/model/GNN_library, define your model (you can refer to the code for models like GCN, GAT, etc.)
In the args_init() function in GNN/model/GNN_arg.py, check to see if it contains all the parameters involved in your model. If there are deficiencies, you can easily add new parameters to this function.
Import the model you defined in GNN/GNN.py and add your corresponding model to the gen_model() function. You can then run the corresponding code to perform the node classification task.

Add your PLM model:

Go to the LM/Model/ path and create a folder named after your model name. Define init.py and config.py in it (see how these two files are defined in other folders).
Add the parameters you need to the parser() function in lm_utils.
If your model can't be loaded from huggingface, please pass in the path to the folder your model corresponds to via the parameter 'pretrain_path'.

Main experiments in CS-TAG

Representation learning on the TAGs often depend on the two type models: Graph Neural Networks and Language Models. For the latter, we often use the Pretrained Language Models (PLMs) to encode the text. For the GNNs, we follow the DGL toolkit and implement them in the GNN library. For the PLMs, we follow the huggingface trainer to implement the PLMs in a same pipeline. We know that there are no absolute fair between the two type baselines.

Citation

If you use our datasets, please consider citing our work:

@article{yan2023comprehensive,
  title={A Comprehensive Study on Text-attributed Graphs: Benchmarking and Rethinking},
  author={Yan, Hao and Li, Chaozhuo and Long, Ruosong and Yan, Chao and Zhao, Jianan and Zhuang, Wenwen and Yin, Jun and Zhang, Peiyan and Han, Weihao and Sun, Hao and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  pages={17238--17264},
  year={2023}
}