Awesome
Ginex
Ginex is a GNN training system for efficient training of a billion-scale dataset on a single machine by using SSD as a memory extension. Ginex accelerates the entire training procedure by provably optimal in-memory caching of feature vectors which reside on SSD without any negative implication on training quality.
Please refer to the full paper here.
Installation and Running a Toy Example
Follow the instructions below to install the requirements and run a toy example using ogbn_papers100M dataset.
Basic Settings
-
Disable
read_ahead
.sudo -s echo 0 > /sys/block/$block_device_name/queue/read_ahead_kb
-
Install necessary Linux packages.
sudo apt-get install -y build-essential
sudo apt-get install -y cgroup-tools
sudo apt-get install -y unzip
sudo apt-get install -y python3-pip
andpip3 install --upgrade pip
- Compatible NVIDIA CUDA driver and toolkit. Visit NVIDIA CUDA Installation Guide for Linux for details.
-
Install necessary Python modules.
-
PyTorch with version of >= 1.9.0. Visit here for details.
-
pip3 install tqdm
-
pip3 install ogb
-
PyG. Visit here for details.
-
Ninja
sudo wget https://github.com/ninja-build/ninja/releases/download/v1.8.2/ninja-linux.zip sudo unzip ninja-linux.zip -d /usr/local/bin/ sudo update-alternatives --install /usr/bin/ninja ninja /usr/local/bin/ninja 1 --force
-
-
Use cgroup to mimic the setting where the dataset size is much larger than the main memory size, as assumed in the paper, with ogbn_papers100M dataset. We recommend to limit the memory size to 8GB.
sudo -s cgcreate -g memory:8gb echo 8000000000 > /sys/fs/cgroup/memory/8gb/memory.limit_in_bytes
-
Make sure to allocate enough swap space. We recommend to allocate at least 4GB for swap space.
sudo fallocate -l 4G swap.img sudo chmod 600 swap.img sudo mkswap swap.img sudo swapon swap.img
Running a toy example
- Clone our library
git clone https://github.com/SNU-ARC/Ginex.git
- Prepare dataset
python3 prepare_dataset.py
- Preprocess (Neighbor cache construction)
python3 create_neigh_cache.py --neigh-cache-size 6000000000
- Get
PYTHONPATH
python3 get_pythonpath.py
- Run baseline, i.e., PyG extended to support disk-based processing of graph dataset (denoted as PyG+ in the paper). Replace
PYTHONPATH=...
with the outcome of step 3.-W ignore
option is used to ignore warnings.sudo PYTHONPATH=/home/user/.local/lib/python3.8/site-packages cgexec -g memory:8gb python3 -W ignore run_baseline.py
- Run Ginex. Replace
PYTHONPATH=...
with the outcome of step 3.-W ignore
option is used to ignore warnings.sudo PYTHONPATH=/home/user/.local/lib/python3.8/site-packages cgexec -g memory:8gb python3 -W ignore run_ginex.py --neigh-cache-size 6000000000 --feature-cache-size 6000000000 --sb-size 1500
Results
The following is the result of the toy example on our local server.
Environment
- CPU: Intel Xeon Gold 6244 CPU 8-core (16 logical cores with hyper-threading) @ 3.60GHz
- GPU: NVIDIA Tesla V100 16GB PCIe
- Memory: Samsung DDR4-2666 64GB (32GB X 2) (cgroup of 8GB is used)
- Storage: Samsung PM1725b 8TB PCIe Gen3 8-lane
- S/W: Ubuntu 18.04.5 & CUDA 11.4 & Python 3.6.9 & PyTorch 1.9
Baseline
Per epoch training time: 216.1687 sec
Ginex
Per epoch training time: 99.5562 sec
(Speedup of 2.2x)
Maintainer
Yeonhong Park (parkyh96@gmail.com)
Sunhong Min (sunhongmin@snu.ac.kr)
Citation
Please cite our paper if you find it useful for your work:
@inproceedings{park2022vldb,
author = {Yeonhong Park and Sunhong Min and Jae W. Lee},
title = {Ginex: SSD-enabled Billion-scale Graph Neural Network Training on a Single Machine via Provably Optimal In-memory Caching},
booktitle = {Proceedings of the VLDB Endowment},
volume = {15},
number = {11},
year = {2022}
}