Home

Awesome

PaGraph

Scaling GNN Training on Large Graphs via Computation-aware Caching and Partitioning. Build based on DGL with PyTorch backend.

PaGraph in master branch supports data caching and graph partition (paper). In overlap branch it additionally supports overlapping data loading and GPU computation (paper).

Prerequisite

Prepare Dataset

Run

Install

$ python setup.py develop
$ python
>> import PaGraph
>> 

Launch Graph Server

For more instructions, checkout server launch files.

Run Trainer

Note: --remote-sample is for enabling isolation. This should be cooperated with server command --sample.

Note: multi-gpus training require OMP_NUM_THREADS settings, or it will show low scalability.

Reminder

Partition is aware of GNN model layers. Please guarantee the consistency of --num-hops, --preprocess when partitioning and training, respectively. Specifically, if --preprocess is enabled in both server and trainer, --num-hops should be the Num of model-layer - 1. Otherwise, keep --num-hops the same as number of GNN layers. In our settings, GCN and GraphSAGE has 2 layers.

Profiling

Citing PaGraph

@inproceedings{lin2020pagraph,
  title={PaGraph: Scaling GNN training on large graphs via computation-aware caching},
  author={Lin, Zhiqi and Li, Cheng and Miao, Youshan and Liu, Yunxin and Xu, Yinlong},
  booktitle={Proceedings of the 11th ACM Symposium on Cloud Computing},
  pages={401--415},
  year={2020}
}
@article{bai2021efficient,
  title={Efficient Data Loader for Fast Sampling-based GNN Training on Large Graphs},
  author={Bai, Youhui and Li, Cheng and Lin, Zhiqi and Wu, Yufei and Miao, Youshan and Liu, Yunxin and Xu, Yinlong},
  journal={IEEE Transactions on Parallel \& Distributed Systems},
  number={01},
  pages={1--1},
  year={2021},
  publisher={IEEE Computer Society}
}

License

This project is under MIT License.

Future Plan

We plan to support PaGraph on MindSpore