Awesome
Readme
OpenGCL is an open-source toolkit, which implements our modularized graph contrastive learning framework. Users can combine different encoders (with readout functions), discriminators, estimators and samplers by one command line, which helps them find well-performing combinations on different tasks and datasets.
Get started
Prerequisites
Install the following packages beforehand:
- Python 3.7 or above;
- Pytorch 1.7.0 or above (follow installation guide here);
- Pytorch-geometric (follow installation guide here);
- Other requirements in requirements.txt.
Introduction
For the simplest use of OpenGCL, you might want to reconstruct a baseline model under our framework. Make sure your working directory is OpenGCL/src
. Try the following command as an example:
python -m opengcl --task node --dataset cora --enc gcn --dec inner --est jsd --sampler snr --output-dim 64 --hidden-size 2 --learning-rate 0.01 --epochs 500 --early-stopping 20 --patience 3 --clf-ratio 0.2
This trains a GAE model on Cora, and performs node classification.
The output should look like this:
[main] (2.2810919284820557s) Loading dataset...
[datasets] Loading Cora Dataset from root dir: .../OpenGCL/data/Cora
[datasets] Downloading dataloaders "Cora" from "https://github.com/kimiyoung/planetoid/raw/master/data".
[datasets] Files will be saved to "../data/Cora".
[downloader] Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.x
...
[downloader] Downloading https://github.com/kimiyoung/planetoid/raw/master/data/ind.cora.test.index
[datasets] Dataset successfully downloaded.
[datasets] This is a medium-sized sparse dataset with 1 graph, 2708 nodes and 10556 edges.
[main] (23.69194722175598s) Splitting dataset...
[main] (23.72783327102661s) Dataset loaded and split.
[main] (23.727877616882324s) Start building GCL framework...
[GCL] Building framework... done in 0.19335460662841797s.
[GCL] Start training for 500 epochs...
[GCL] epoch 5: loss: 1.28604; time used = 1.39682936668396s
...
[GCL] epoch 180: loss: 0.87510; time used = 0.9593579769134521s
[GCL] Early stopping condition satisfied. Abort training.
[GCL] Framework training finished in 37.25677442550659s.
[main] (61.178184032440186s) GCL framework built and trained.
[main] (61.17819690704346s) Start classification...
[main] (66.95774221420288s) Classification finished.
[main] F1-score: micro = 0.7563451776649746.
After the task is finished, you should be able to find a log file in OpenGCL/logs
, named as YYYYmmdd_HHMMSS_X_cora_node.txt
, which writes something similar to
python3 -m opengcl --task node --dataset cora --enc gcn --dec inner --est jsd --sampler snr --output-dim 64 --hidden-size 2 --learning-rate 0.01 --epochs 500 --early-stopping 20 --patience 3 --clf-ratio 0.2
0.7563451776649746
Usage
In OpenGCL/src
, run OpenGCL by the following command:
python -m opengcl --task TASK --dataset DATASET {MODULE_PARAMS} {HYPER_PARAMS}
with
{MODULE_PARAMS} := --enc ENCODER [--readout READOUT] --dec DISCRIMINATOR --est ESTIMATOR --sampler [SAMPLER]
{HYPER_PARAMS} := [--output-dim INTEGER] [--hidden-size INTEGER] [--dropout FLOAT] [--batch-size INTEGER] [--learning-rate FLOAT] [--epochs INTEGER] [--early-stopping INTEGER] [--patience INTEGER] [--clf-ratio FLOAT]
We will discuss each parameter in detail in the following subsections.
Task
Choose one downstream task for GCL. Choose node
for node classification, or graph
for graph classification.
Range
--task {node, graph}
: two classification tasks
Dataset
Choose one dataset for GCL. We provide 13 datasets, 5 of which are multi-graph, and 8 are single-graph datasets.
Range
--task {cora, citeseer, pubmed, amazon_computers, amazon_photo, coauthor_cs, coauthor_phy, wikics}
: single-graph datasets--task {reddit_binary, imdb_binary, imdb_multi, mutag, ptc_mr}
: multi-graph datasets.
Module parameters
Select encoder, decoder, estimator and sampler that make up the model.
Encoder
Choose GNN encoders (gcn
, gin
and gat
), MLP encoder (linear
), or lookup table (none
) for encoder.
Encoders are single or multiple consecutive layers that translate node features to node embedding vectors. Lookup table, of course, has only one layer.
Range
--enc {gcn, gin, gat, linear, none}
Related parameters
--output-dim
: output dimension of each layer.--hidden-size
: number of hidden layers. Total layer number should beHIDDEN_SIZE + 1
. The dimensions of the input layer is(feature_dim, output_dim)
, while all the following layers have dimensions(output_dim, output_dim)
.
Readout
To get graph embedding, node vectors should bypass a readout function. Choose from sum
pooling, mean
pooling or jk-net
.
Range
[--readout {sum, mean, jk-net}]
- Optional; not needed (ignored) on some occasions. The default value is
mean
.
Related parameters
--task
: for graph classification, a readout function is needed to get graph embeddings.--sampler
: for samplers that require global view, graph embeddings are needed.
Discriminator
Choose one discriminator from inner product (inner
), bilinear product (bilinear
).
Discriminators calculate likelihood for sample pairs.
Range
--dec {inner, bilinear}
Estimator
Choose one estimator from Jensen-Shannon Divergence (jsd
) or InfoNCE (nce
) loss.
Estimators calculate loss for likelihood scores of (anchor, positive) and (anchor, negative) pairs.
Range
--est {jsd, nce}
Sampler
Choose from single-view samplers (s_
samplers) and multi-view samplers (dgi
, mvgrl
, gca
, graphcl
).
Samplers are described as below. Views of (anchors - context samples) are listed in 'View(s)'.
Sampler | Source | Description | View(s) |
---|---|---|---|
snr | LINE, GAE | anc, pos: neighbors; neg: random sample | single; local-local |
srr | DeepWalk | anc, pos: random walk; neg: random sample | single; local-local |
dgi | DGI, InfoGraph | anc: graph G; pos: nodes in G; neg: nodes in other graphs | multi; global-local |
mvgrl | MVGRL | anc: graph; pos: nodes in G diffused; neg: nodes in other diffused graphs | multi; global-local; augmentative |
gca | GCA | two augmentations G1, G2; anc, pos: corresponding nodes in G1, G2; neg: all other nodes | multi; local-local; augmentative |
graphcl | GraphCL | two augmentations G1, G2; anc, pos: G1, G2; neg: another graph | multi; global-global; augmentative |
Range
--sampler {snr, srr, dgi, mvgrl, gca, graphcl}
Hyperparameters
Hyperparameters are structural or training parameters of the model or training process. These are optional parameters; default values are given. We list the hyperparameters, their meanings, type, range and default values in the following table.
Parameter | Meaning | Type | Range | Default |
---|---|---|---|---|
--output-dim | output dimension of embeddings | structural | positive integer | 64 |
--hidden-size | number of hidden layers | structural | natural number | 0 |
--dropout | dropout factor while training | training | float in [0, 1) | 0. |
--batch-size | number of samples in each batch | training | positive integer | 4096 |
--learning-rate | learning rate | training | positive float | 0.01 |
--epochs | number of training epochs | training | natural number | 500 |
--early-stopping | minimum epoch for early stopping | training | natural number | 20 |
--patience | early stop after PATIENCE epochs of consecutive loss growth | training | natural number | 3 |
--clf-ratio | ratio of dataset used to train classifier | training | float in (0, 1) | 0.5 |
Baselines
We list example commands of baseline models here. We use Cora for node classification and MUTAG for graph classification.
Name | Task | Command |
---|---|---|
GAE | node | python -m opengcl --task node --dataset cora --enc gcn --dec inner --est jsd --sampler snr --epochs 500 --clf-ratio 0.2 |
GCA | node | python -m opengcl --task node --dataset cora --enc gcn --dec inner --est nce --sampler gca --epochs 500 --clf-ratio 0.2 |
DGI | node | python -m opengcl --task node --dataset cora --enc gcn --readout mean --dec bilinear --est jsd --sampler dgi --epochs 500 --clf-ratio 0.2 |
DGI | graph | python -m opengcl --task graph --dataset mutag --enc gcn --readout mean --dec bilinear --est jsd --sampler dgi --epochs 500 --clf-ratio 0.8 |
MVGRL | node | python -m opengcl --task node --dataset cora --enc gcn --readout sum --dec inner --est jsd --sampler mvgrl --epochs 500 --clf-ratio 0.2 |
MVGRL | graph | python -m opengcl --task graph --dataset mutag --enc gcn --readout sum --dec inner --est jsd --sampler mvgrl --epochs 500 --clf-ratio 0.8 |
InfoGraph | graph | python -m opengcl --task graph --dataset mutag --enc gin --readout sum --dec inner --est jsd --sampler dgi --epochs 500 --clf-ratio 0.8 |
GraphCL | graph | python -m opengcl --task graph --dataset mutag --enc gin --readout sum --dec inner --est nce --sampler graphcl --epochs 500 --clf-ratio 0.8 |