Home

Awesome

Scalable GNN training by graph summarization

This repository contains an implementation of the coarsening via convolution matching algorithm along with scripts for performing ablation studies and comparing the method to baselines. The motivation for this code is to experiment with the coarsening algorithm and demonstrate its utility for scalable graph neural network (GNN) training.

Requirements

The scripts were developed and ran with the following packages installed:

boto3==1.24.88
botocore==1.27.88
deeprobust==0.2.5
dgl==0.9.1
gensim==3.8.3
networkx==2.8.4
numpy==1.23
ogb==1.3.4
PyGSP==0.5.1
PyYAML==6.0
pandas==1.4.3
scikit-learn==1.1.1
scipy==1.8.1
sklearn==0.0
sortedcontainers==2.4.0
torch==1.12.1
torch-scatter==2.0.9
torch-sparse==0.6.15
torch-geometric==2.1.0
tqdm==4.64.0

The requirements are also specified in requirements.txt.

Usage

This repository is designed for experimenting with the coarsening via convolution matching algorithm and comparing its performance with baseline algorithms for scalable GNN training via graph summarization.

Experiment Scripts, Configurations, and Parameters

The ./Experiments directory contains the scripts and configuration files for running all the implemented graph summarization algorithms and GNN models on all the datasets. The code is structured such that experiments are categorized by the task the GNN is being trained for: node classification or link prediction.

The configuration file: ./Experiments/config/run_config.yaml is a .yaml formatted file for specifying the combination of graph summarization methods, GNNs, loss, and optimization method that will be iterated over by the experiment scripts. In this file, you may also specify the number of splits you wish to run and the location of the base output directory i.e., where results will be saved.

The parameters of the graph summarization algorithms, GNN models, and optimizer are specified in the ./Experiments/config/params.yaml file.

To run every combination of model and parameter specified in the ./Experiments/config/run_config.yaml and ./Experiments/config/params.yaml for link prediction tasks run:

python3 ./Experiments/link_prediction_graph_summarization.py

To run every combination of model and parameter specified in the ./Experiments/config/run_config.yaml and ./Experiments/config/params.yaml for node classification tasks run:

python3 ./Experiments/node_classification_graph_summarization.py

The experiment scripts use the utilities provided in the files in the ./Experiments/utils directory. Importantly, configurations are read and parsed using utilites in ./Experiments/utils/config_utils.py. If a new graph summarizer, model, dataset, or optimizer is implemented, then its mapping from its reference in the ./Experiments/config/run_config.yaml file to the graph summarizer class name must be added to the dictionaries in ./Experiments/utils/config_utils.py.

Results and Analysis

The experiment scripts save summaries, statistics, trained models, summarized graphs, and predictions for each experiment it runs in the directory specified in ./Experiments/config/run_config.yaml. Precisely, each run may output the following saved files for graph summarization:

And may output the following saved files for model training:

The experiment runs are defined by the parameters and configurations set in the ./Experiments/config/params.yaml and ./Experiments/config/run_config.yaml files, respectively. The parameters and configurations define the file path where the output files are saved, allowing the support for sharing cached graph summaries between experiments. For instance, depending on the specified base output directory, the results file path to the graph summarization information could be structured like

./results/linkPredictionGraphSummarization/<run_config>::<value>/.../<param>::<value>/.../train_graph

Furthermore, the results file path to the experiment information could be structured like

./results/linkPredictionGraphSummarization/<run_config>::<value>/.../<param>::<value>/...

Example scripts for parsing the results and notebooks for analyzing the results are provided in the ./Analysis directory. The ./Analysis/parse_results.py script will organize experiment summaries into a single results.csv file that will include the configurations, parameters, and results of each experiment that were ran. Currently, the ./Analysis/parse_results.py assumes experiment names that align with the task: node classification or link prediction, and shows how to extract parameters and configurations from the results directory path.

Datasets

Dataset loaders are provided in the ./Datasets directory. Every dataset implements the Dataset class implemented in ./Datasets/Dataset.py so there is a common interface for experiment scripts to use. The datasets are organized by the task types: link prediction and node classification. Link prediction datasets are found in the ./Datasets/LinkPrediction directory while node classification datasets are found in the ./Datasets/NodeClassification directory. Many of the datasets require you be connected to the internet to download.

Graph Summarizers

The implementation of graph summarizers are found in the ./GraphSummarizers directory.

The abstract GraphSummarizer class is found in ./GraphSummarizers/GraphSummarizer.py. All graph summarizer extend this class, notably so the method GraphSummarizer.summarize() is implemented for the experiment scripts to call. Graph summarizers are organized into one of two types: Coarseners or Samplers.

Coarseners will coarsen the original graph by merging nodes into supernodes. Coarseners are found in the ./GraphSummarizers/Coarsener directory. All coarseners extend the abstract Coarsener class defined in ./GraphSummarizers/Coarsener/Coarsener.py. Currently, the coarseners defined include:

Coarseners are further grouped by the task: node classification or link prediction. This is necessary as node classification tasks require different task data to be maintained during the coarsening process.

Samplers will sample either nodes or edges in the original graph. Samplers are found in the ./GraphSummarizers/Sampler directory. Currently, the samplers defined include:

Contact

Charles Dickens chrlsdkn@amazon.com