Home

Awesome

HERI-GCN

Description

Our HERI-GCN is implemented mainly based on the following libraries (see the README file in source code folder for more details):

File Tree

Project file structure and description:

HERI-GCN
├─ README.md
├─ __init__.py
├─ dataloading	# package of dataloading
│    ├─ __init__.py
│    ├─ datamodule.py	
│    └─ dataset	# package of dataset class
│           ├─ WeiboTopicDataset.py	# process data of csv format 
│           └─ __init__.py
├─ hooks.py	# hooks of model
├─ model	# package of models (HERI-GCN and its variants)
│    ├─ PopularityPredictor.py
│    ├─ __init__.py
├─ nn	# package of neural network layers
│    ├─ __init__.py
│    ├─ conv.py	# graph convolution layer
│    └─ readout.py	# output layer
├─ requirements.txt	
├─ run.py	# running entrance
└─ utils	# utils of drawing, dataloading, tensor and parameter propcessing
       ├─ __init__.py
       ├─ arg_parse.py
       ├─ dataloading.py
       ├─ drawing.py # heterogenous graph drawing
       ├─ output.py
       └─ utils.py

Models

In model package, we designed three models:

ModelInput RelationsComputation
BasePopularityPredictor<br>(HERI-GCN-UG)(user, repost, user)<br>(user, follow, user)Heterogeneous GCN
TimeGNNPopularityPredictor<br>(HERI-GCN-TG)(user, repost, user)<br/>(user, follow, user)<br/>(user, post at, time)<br/>(time, contain, user)<br/>(time, past to, time)Heterogeneous GCN
TimeRNNPopularityPredictor<br>(HERI-GCN)(user, repost, user)<br/>(user, follow, user)<br>(user, post at, time)<br/>(time, contain, user)<br>(time, past to, time)Heterogeneous GCN <br>Integrated RNN

BasePopularityPredictor is the basic predictor model, TimeGNNPopularityPredictor inherit from BasePopularityPredictor, and TimeRNNPopularityPredictor inherit from TimeGNNPopularityPredictor.

Installation

Installation requirements are described in requirements.txt.

Usage

Get helps for all parameters of data processing, training and optimization:

python run.py  --help

Run:

Python run.py --paramater value

Experiment Settings

The flag-parameter is boolean type, it works with default value, and the specific value is not necessary.

The settings of the ==highlighting== parameters are analyzed in detail in the following section.

Basic settings (common for all experiments, and recommend to use auto_lr_find to optimize the learning rate):

ParameterValueDescription
readout_useallUse both time feature and user feature to output, or just one of them.
in_feats16Input feature dimension.
hid_feats32Hidden feature dimension.
out_feats64Output feature dimension.
dropout_rate0.3Dropout rate.
rnn_feats32RNN feature dimension.
batch_size4 (8 for Twitter)Batch size.
learning_rate5e-3Learning rate.
weight_decay5e-3Weight of L2 regulation.
gcn_layers3Number of heterogeneous GCN layers.
rnn_layers2Number of RNN layers.
rnn_bidirectionalTrue (flag)Is RNN bi-directional.
rnngruInstance of RNN module (GRU or LSTM).
time_nodes50Number of time nodes added into heterogeneous graph.
splittimeSplit user sequence into time intervals according to user number or time.
hop1Hops of follower sampling.
patience20Patience of training early stopping.
readout_weighted_sumTrue (flag)Use weighted sum or product of user feature and time feature to output.

Special settings:

ParameterDefault valueDescription
modelTimeRGNNSpecify the model, choice from [UserGNN, TimeGNN, TimeRGNN].
data_nametopicSpecify the dataset, choice from [twitter, repost, topic].
time_window24Specify the time window to observe (hours).
dataloaderweiboSpecify dataloader for a certain data format, the default dataloader loads data from csv format.
raw_dir./dataSpecify the data directory.
min_cascade_length20 for weibo dataset, 5 for twitter datasetMinimal cascade length, the shorter cascades will be filter out in data processing.

Other optimization settings (inherited from pytorch_lightning.Trainer):

ParameterRecommend valueDescription
gpus0 for CPU only, 1 for single GPU.Number of GPUs to train on (int), or which GPUs to train on (list:[int, str]).
num_processes0 for Windows, other systems can be set up according to your needs.Number of processes to train with.
max_epochs100Stop training once this number of epochs is reached.
auto_lr_findTrue (flag)Runs a learning rate finder algorithm to find optimal initial learning rate.
auto_scale_batch_sizeTrue (flag), power, binsearchAutomatically tries to find the largest batch size that fits into memory, before any training.
gradient_clip_val0.005Gradient clipping value.
stochastic_weight_avgTrue (flag)Stochastic Weight Averaging (SWA), to smooths the loss landscape thus making it harder to end up in a local minimum during optimization.

Cite Us

@inproceedings{HERIGCN_2022,
  author          = {Wu, Zhen and Zhou, Jingya and Liu, Ling and Li, Chaozhuo and Gu, Fei},
  booktitle       = {2022 IEEE 38th International Conference on Data Engineering (ICDE)},
  title           = {Deep Popularity Prediction in Multi-Source Cascade with HERI-GCN},
  year            = {2022}
}