Home

Awesome

GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection

This is the official implementation of the following paper:

GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection

Jianheng Tang, Fengrui Hua, Ziqi Gao, Peilin Zhao, Jia Li

NeurIPS 2023 Datasets and Benchmarks Track

Environment Setup

Before you begin, ensure that you have Anaconda or Miniconda installed on your system. This guide assumes that you have a CUDA-enabled GPU.

# Create and activate a new Conda environment named 'GADBench'
conda create -n GADBench
conda activate GADBench

# Install Pytorch and DGL with CUDA 11.7 support
# If your use a different CUDA version, please refer to the PyTorch and DGL websites for the appropriate versions.
conda install numpy
conda install pytorch==1.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c dglteam/label/cu117 dgl

# Install additional dependencies
conda install pip
pip install xgboost pyod scikit-learn sympy pandas catboost bidict openpyxl

Dataset Preparation

GADBench utilizes 10 different datasets, which can be downloaded from this google drive link. After downloading, unzip all the files into a folder named datasets within the GADBench directory. GADBench includes an example dataset reddit, which does not require manual downloading.

Due to the Copyright of DGraph-Fin and Elliptic, you need to download these datasets by yourself. The script to preprocess DGraph-Fin and Elliptic can be found in datasets/preprocess.inpynb. You can also preprocess your own dataset according to the notebook.

Benchmarking

With Default Hyperparameters

Benchmark the GCN model on the example Reddit dataset under the fully-supervised setting (single trial).

python benchmark.py --trial 1 --datasets 0 --models GCN

Benchmark GIN and BWGNN on all 10 datasets in the semi-supervised setting (10 trials).

python benchmark.py --trial 10 --datasets 0-9 --models GIN-BWGNN --semi_supervised 1 

Benchmark 25 GAD models on all 10 datasets in the fully-supervised setting (10 trials). It requires an Nvidia GPU with more than 48GB memory.

python benchmark.py --trial 10 --datasets 0-9 

Benchmark multiple models in the inductive setting

python benchmark.py --datasets 5,8 --models GAT-GraphSAGE-XGBGraph --inductive 1

Benchmark multiple models on heterogeneous graph datasets

python benchmark.py --datasets 10,11 --models RGCN-HGT-CAREGNN-H2FD

With Optimal Hyperparameters through Random Search

Perform a random search of hyperparameters for the GCN model on the Reddit dataset in the fully-supervised setting (100 trials).

python random_search.py --trial 100 --datasets 0 --models GCN

Perform a random search of hyperparameters for all 26 models on all 10 datasets in the fully-supervised setting (100 trials).

python random_search.py --trial 100

Reference

Dataset Information

In the table below, we provide a summary of all datasets in GADBench, detailing the source, number of nodes, edges, and node feature dimensions. We also highlight the ratio of anomalous labels, the training ratio in a fully-supervised setting, the concept of relations, and the type of node features. Misc. signifies that the node features comprise a mix of various attributes, potentially including categorical, numerical, and temporal data.

IDName#Nodes#Edges#Dim.AnomalyTrainRelation ConceptFeature Type
0Reddit10,984168,016643.3%40%Under Same PostText Embedding
1Weibo8,405407,96340010.3%40%Under Same HashtagText Embedding
2Amazon11,9444,398,392259.5%70%Review CorrelationMisc. Information
3YelpChi45,9543,846,9793214.5%70%Reviewer InteractionMisc. Information
4Tolokers11,758519,0001021.8%40%Work CollaborationMisc. Information
5Questions48,921153,5403013.0%52%Question AnsweringText Embedding
6T-Finance39,35721,222,543104.6%50%Transaction RecordMisc. Information
7Elliptic203,769234,3551669.8%50%Payment FlowMisc. Information
8DGraph-Fin3,700,5504,300,999171.3%70%Loan GuarantorMisc. Information
9T-Social5,781,06573,105,508103.0%40%Social FriendshipMisc. Information
10Amazon (Hetero)11,9444,398,392259.5%70%Review CorrelationMisc. Information
11YelpChi (Hetero)45,9543,846,9793214.5%70%Reviewer InteractionMisc. Information

Citation

If you use this package and find it useful, please cite our paper using the following BibTeX. Thanks! :)

@inproceedings{tang2023gadbench,
 author = {Tang, Jianheng and Hua, Fengrui and Gao, Ziqi and Zhao, Peilin and Li, Jia},
 booktitle = {Advances in Neural Information Processing Systems},
 pages = {29628--29653},
 title = {GADBench: Revisiting and Benchmarking Supervised Graph Anomaly Detection},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/5eaafd67434a4cfb1cf829722c65f184-Paper-Datasets_and_Benchmarks.pdf},
 volume = {36},
 year = {2023}
}