Awesome

Scalable GNN training by graph summarization

This repository contains an implementation of the coarsening via convolution matching algorithm along with scripts for performing ablation studies and comparing the method to baselines. The motivation for this code is to experiment with the coarsening algorithm and demonstrate its utility for scalable graph neural network (GNN) training.

Requirements

The scripts were developed and ran with the following packages installed:

boto3==1.24.88
botocore==1.27.88
deeprobust==0.2.5
dgl==0.9.1
gensim==3.8.3
networkx==2.8.4
numpy==1.23
ogb==1.3.4
PyGSP==0.5.1
PyYAML==6.0
pandas==1.4.3
scikit-learn==1.1.1
scipy==1.8.1
sklearn==0.0
sortedcontainers==2.4.0
torch==1.12.1
torch-scatter==2.0.9
torch-sparse==0.6.15
torch-geometric==2.1.0
tqdm==4.64.0

The requirements are also specified in requirements.txt.

Usage

This repository is designed for experimenting with the coarsening via convolution matching algorithm and comparing its performance with baseline algorithms for scalable GNN training via graph summarization.

Experiment Scripts, Configurations, and Parameters

The ./Experiments directory contains the scripts and configuration files for running all the implemented graph summarization algorithms and GNN models on all the datasets. The code is structured such that experiments are categorized by the task the GNN is being trained for: node classification or link prediction.

The configuration file: ./Experiments/config/run_config.yaml is a .yaml formatted file for specifying the combination of graph summarization methods, GNNs, loss, and optimization method that will be iterated over by the experiment scripts. In this file, you may also specify the number of splits you wish to run and the location of the base output directory i.e., where results will be saved.

The parameters of the graph summarization algorithms, GNN models, and optimizer are specified in the ./Experiments/config/params.yaml file.

To run every combination of model and parameter specified in the ./Experiments/config/run_config.yaml and ./Experiments/config/params.yaml for link prediction tasks run:

python3 ./Experiments/link_prediction_graph_summarization.py

To run every combination of model and parameter specified in the ./Experiments/config/run_config.yaml and ./Experiments/config/params.yaml for node classification tasks run:

python3 ./Experiments/node_classification_graph_summarization.py

The experiment scripts use the utilities provided in the files in the ./Experiments/utils directory. Importantly, configurations are read and parsed using utilites in ./Experiments/utils/config_utils.py. If a new graph summarizer, model, dataset, or optimizer is implemented, then its mapping from its reference in the ./Experiments/config/run_config.yaml file to the graph summarizer class name must be added to the dictionaries in ./Experiments/utils/config_utils.py.

Results and Analysis

The experiment scripts save summaries, statistics, trained models, summarized graphs, and predictions for each experiment it runs in the directory specified in ./Experiments/config/run_config.yaml. Precisely, each run may output the following saved files for graph summarization:

graph_summarization_statistics.csv: This files reports the graph summarization time and various statistics computed on the original and summarized graph.
original_to_super_node_id.npy: This file is a numpy array mapping the original graph node ids to the supernodes they belong to in the coarsened graph.
summarized_graph.bin: This file is the summarized DGL graph.

And may output the following saved files for model training:

experiment_summary.csv: This files reports the graph summarization time, training time, and test and validation performances of the run.
training_summary.csv: This files reports the final training loss, training time, max GPU memory used, and validation performance of training.
training_convergence.csv: This files reports the training loss, training time, validation performance, and best validation performance for every compute period during training.
<task_prefix>predictions.pt: This file is the saved test predictions of the run.
trained_<model_type>_parameters.pt: This files is the final trained model parameters.

The experiment runs are defined by the parameters and configurations set in the ./Experiments/config/params.yaml and ./Experiments/config/run_config.yaml files, respectively. The parameters and configurations define the file path where the output files are saved, allowing the support for sharing cached graph summaries between experiments. For instance, depending on the specified base output directory, the results file path to the graph summarization information could be structured like

./results/linkPredictionGraphSummarization/<run_config>::<value>/.../<param>::<value>/.../train_graph

Furthermore, the results file path to the experiment information could be structured like

./results/linkPredictionGraphSummarization/<run_config>::<value>/.../<param>::<value>/...

Example scripts for parsing the results and notebooks for analyzing the results are provided in the ./Analysis directory. The ./Analysis/parse_results.py script will organize experiment summaries into a single results.csv file that will include the configurations, parameters, and results of each experiment that were ran. Currently, the ./Analysis/parse_results.py assumes experiment names that align with the task: node classification or link prediction, and shows how to extract parameters and configurations from the results directory path.

Datasets

Dataset loaders are provided in the ./Datasets directory. Every dataset implements the Dataset class implemented in ./Datasets/Dataset.py so there is a common interface for experiment scripts to use. The datasets are organized by the task types: link prediction and node classification. Link prediction datasets are found in the ./Datasets/LinkPrediction directory while node classification datasets are found in the ./Datasets/NodeClassification directory. Many of the datasets require you be connected to the internet to download.

Graph Summarizers

The implementation of graph summarizers are found in the ./GraphSummarizers directory.

The abstract GraphSummarizer class is found in ./GraphSummarizers/GraphSummarizer.py. All graph summarizer extend this class, notably so the method GraphSummarizer.summarize() is implemented for the experiment scripts to call. Graph summarizers are organized into one of two types: Coarseners or Samplers.

Coarseners will coarsen the original graph by merging nodes into supernodes. Coarseners are found in the ./GraphSummarizers/Coarsener directory. All coarseners extend the abstract Coarsener class defined in ./GraphSummarizers/Coarsener/Coarsener.py. Currently, the coarseners defined include:

ConvolutionMatching<initial_node_pairing_method>: These coarseners extend the ConvolutionMatchingCoarsener class.
- ConvolutionMatchingCoarsener's iteratively merge pairs of nodes into a supernode that minimize the convolution matching loss.
- The initial_node_pairing_method defines the way the initial set of node pairs is generated for the coarsener.
ApproximateConvolutionMatching<initial_node_pairing_method>: These coarseners extend the ApproximateConvolutionMatchingCoarsener class.
- ApproximateConvolutionMatchingCoarsener's iteratively merge pairs of nodes into a supernode that minimize an approximation to the convolution matching loss.
- The initial_node_pairing_method defines the way the initial set of node pairs is generated for the coarsener.
VariationNeighborhoods: This coarsener is a graph summarizer from the paper: "Scaling Up Graph Neural Networks Via Graph Coarsening" by Zengfeng Huang and Shengzhong Zhang and Chong Xi and Tang Liu and Min Zhou, 2021.

Coarseners are further grouped by the task: node classification or link prediction. This is necessary as node classification tasks require different task data to be maintained during the coarsening process.

Samplers will sample either nodes or edges in the original graph. Samplers are found in the ./GraphSummarizers/Sampler directory. Currently, the samplers defined include:

RandomEdgeSampler: This sampler randomly samples edges to include in the summarized graph.

Contact

Charles Dickens chrlsdkn@amazon.com