Awesome
Scalable GNN training by graph summarization
This repository contains an implementation of the coarsening via convolution matching algorithm along with scripts for performing ablation studies and comparing the method to baselines. The motivation for this code is to experiment with the coarsening algorithm and demonstrate its utility for scalable graph neural network (GNN) training.
Requirements
The scripts were developed and ran with the following packages installed:
boto3==1.24.88
botocore==1.27.88
deeprobust==0.2.5
dgl==0.9.1
gensim==3.8.3
networkx==2.8.4
numpy==1.23
ogb==1.3.4
PyGSP==0.5.1
PyYAML==6.0
pandas==1.4.3
scikit-learn==1.1.1
scipy==1.8.1
sklearn==0.0
sortedcontainers==2.4.0
torch==1.12.1
torch-scatter==2.0.9
torch-sparse==0.6.15
torch-geometric==2.1.0
tqdm==4.64.0
The requirements are also specified in requirements.txt
.
Usage
This repository is designed for experimenting with the coarsening via convolution matching algorithm and comparing its performance with baseline algorithms for scalable GNN training via graph summarization.
Experiment Scripts, Configurations, and Parameters
The ./Experiments
directory contains the scripts and configuration files for running all the implemented
graph summarization algorithms and GNN models on all the datasets.
The code is structured such that experiments are categorized by the task the GNN is being trained
for: node classification or link prediction.
The configuration file: ./Experiments/config/run_config.yaml
is a .yaml
formatted file for specifying the
combination of graph summarization methods, GNNs, loss, and optimization method that will be iterated over by the
experiment scripts.
In this file, you may also specify the number of splits you wish to run and the location of the base output directory i.e.,
where results will be saved.
The parameters of the graph summarization algorithms, GNN models, and optimizer are specified in the
./Experiments/config/params.yaml
file.
To run every combination of model and parameter specified in the ./Experiments/config/run_config.yaml
and
./Experiments/config/params.yaml
for link prediction tasks run:
python3 ./Experiments/link_prediction_graph_summarization.py
To run every combination of model and parameter specified in the ./Experiments/config/run_config.yaml
and
./Experiments/config/params.yaml
for node classification tasks run:
python3 ./Experiments/node_classification_graph_summarization.py
The experiment scripts use the utilities provided in the files in the ./Experiments/utils
directory.
Importantly, configurations are read and parsed using utilites in ./Experiments/utils/config_utils.py
.
If a new graph summarizer, model, dataset, or optimizer is implemented, then its mapping from its reference in
the ./Experiments/config/run_config.yaml
file to the graph summarizer class name must be added to the dictionaries
in ./Experiments/utils/config_utils.py
.
Results and Analysis
The experiment scripts save summaries, statistics, trained models, summarized graphs, and predictions
for each experiment it runs in the directory specified in
./Experiments/config/run_config.yaml
.
Precisely, each run may output the following saved files for graph summarization:
graph_summarization_statistics.csv
: This files reports the graph summarization time and various statistics computed on the original and summarized graph.original_to_super_node_id.npy
: This file is a numpy array mapping the original graph node ids to the supernodes they belong to in the coarsened graph.summarized_graph.bin
: This file is the summarized DGL graph.
And may output the following saved files for model training:
experiment_summary.csv
: This files reports the graph summarization time, training time, and test and validation performances of the run.training_summary.csv
: This files reports the final training loss, training time, max GPU memory used, and validation performance of training.training_convergence.csv
: This files reports the training loss, training time, validation performance, and best validation performance for every compute period during training.<task_prefix>predictions.pt
: This file is the saved test predictions of the run.trained_<model_type>_parameters.pt
: This files is the final trained model parameters.
The experiment runs are defined by the parameters and configurations set in the
./Experiments/config/params.yaml
and ./Experiments/config/run_config.yaml
files, respectively.
The parameters and configurations define the file path where the output files are saved, allowing the support for
sharing cached graph summaries between experiments.
For instance, depending on the specified base output directory,
the results file path to the graph summarization information could be structured like
./results/linkPredictionGraphSummarization/<run_config>::<value>/.../<param>::<value>/.../train_graph
Furthermore, the results file path to the experiment information could be structured like
./results/linkPredictionGraphSummarization/<run_config>::<value>/.../<param>::<value>/...
Example scripts for parsing the results and notebooks for analyzing the results
are provided in the ./Analysis
directory.
The ./Analysis/parse_results.py
script will organize experiment summaries into a single results.csv
file
that will include the configurations, parameters, and results of each experiment that were ran.
Currently, the ./Analysis/parse_results.py
assumes experiment names that align with the
task: node classification or link prediction, and shows how to extract parameters and configurations from the
results directory path.
Datasets
Dataset loaders are provided in the ./Datasets
directory.
Every dataset implements the Dataset
class implemented in ./Datasets/Dataset.py
so there is a common interface for
experiment scripts to use.
The datasets are organized by the task types: link prediction and node classification.
Link prediction datasets are found in the ./Datasets/LinkPrediction
directory while node classification datasets are
found in the ./Datasets/NodeClassification
directory.
Many of the datasets require you be connected to the internet to download.
Graph Summarizers
The implementation of graph summarizers are found in the ./GraphSummarizers
directory.
The abstract GraphSummarizer
class is found in ./GraphSummarizers/GraphSummarizer.py
.
All graph summarizer extend this class, notably so the method GraphSummarizer.summarize()
is implemented for the
experiment scripts to call.
Graph summarizers are organized into one of two types: Coarseners or Samplers.
Coarseners will coarsen the original graph by merging nodes into supernodes.
Coarseners are found in the ./GraphSummarizers/Coarsener
directory.
All coarseners extend the abstract Coarsener
class defined in ./GraphSummarizers/Coarsener/Coarsener.py
.
Currently, the coarseners defined include:
ConvolutionMatching<initial_node_pairing_method>
: These coarseners extend theConvolutionMatchingCoarsener
class.ConvolutionMatchingCoarsener
's iteratively merge pairs of nodes into a supernode that minimize the convolution matching loss.- The initial_node_pairing_method defines the way the initial set of node pairs is generated for the coarsener.
ApproximateConvolutionMatching<initial_node_pairing_method>
: These coarseners extend theApproximateConvolutionMatchingCoarsener
class.ApproximateConvolutionMatchingCoarsener
's iteratively merge pairs of nodes into a supernode that minimize an approximation to the convolution matching loss.- The initial_node_pairing_method defines the way the initial set of node pairs is generated for the coarsener.
VariationNeighborhoods
: This coarsener is a graph summarizer from the paper: "Scaling Up Graph Neural Networks Via Graph Coarsening" by Zengfeng Huang and Shengzhong Zhang and Chong Xi and Tang Liu and Min Zhou, 2021.
Coarseners are further grouped by the task: node classification or link prediction. This is necessary as node classification tasks require different task data to be maintained during the coarsening process.
Samplers will sample either nodes or edges in the original graph.
Samplers are found in the ./GraphSummarizers/Sampler
directory.
Currently, the samplers defined include:
RandomEdgeSampler
: This sampler randomly samples edges to include in the summarized graph.
Contact
Charles Dickens chrlsdkn@amazon.com