Home

Awesome

<div align="center"> <img src="./assets/logo.png" alt="logo" /> </div> # 🚀 A-Unified-Framework-for-Deep-Attribute-Graph-Clustering

Recently, deep attribute graph clustering has developed rapidly. At the same time various methods have sprung up. Although most of the methods are open-source, it is a pity that these codes do not have a unified framework, which makes researchers have to spend a lot of time modifying the code to achieve the purpose of reproduction. Fortunately, Liu et al. [Homepage: yueliu1999] organized the deep graph clustering method into a code warehouse—— Awesome-Deep-Graph-Clustering(ADGC). For example, they provided more than 20 datasets and unified the format. Moreover, they list the most related paper about deep graph clustering and give the link of source code. It is worth mentioning that they organize the code of deep graph clustering into rand-augmentation-model-clustering-visualization-utils structure, which greatly facilitates beginners and researchers. Here, on behalf of myself, I would like to express my sincere thanks and high respect to Liu et al.

❤️ Acknowledgements:

Thanks for the open source of these authors (not listed in order):

[ yueliu1999 | bdy9527| Liam Liu | Zhihao PENG | William Zhu | WxTu ]

[ xihongyang1999 | gongleii ]

<a href="https://github.com/yueliu1999" target="_blank"><img src="https://avatars.githubusercontent.com/u/41297969?s=64&v=4" alt="yueliu1999" width="48" height="48"/></a> <a href="https://github.com/bdy9527" target="_blank"><img src="https://avatars.githubusercontent.com/u/16743085?s=64&v=4" alt="bdy9527" width="48" height="48"/></a> <a href="https://github.com/Tiger101010" target="_blank"><img src="https://avatars.githubusercontent.com/u/34651180?s=64&v=4" alt="Liam Liu" width="48" height="48"/></a> <a href="https://github.com/ZhihaoPENG-CityU" target="_blank"><img src="https://avatars.githubusercontent.com/u/23076563?s=64&v=4" alt="Zhihao PENG" width="48" height="48"/> </a><a href="https://github.com/grcai" target="_blank"><img src="https://avatars.githubusercontent.com/u/38714987?s=64&v=4" alt="William Zhu" width="48" height="48"/></a> <a href="https://github.com/WxTu" target="_blank"><img src="https://avatars.githubusercontent.com/u/50702801?v=4" height="48" width="48" alt="WxTu"></a>

<a href="https://github.com/xihongyang1999" target="_blank"><img src="https://avatars.githubusercontent.com/u/94908575?v=4" height="48" width="48" alt="xihongyang1999"></a> <a href="https://github.com/gongleii" target="_blank"><img src="https://avatars.githubusercontent.com/u/43403230?v=4" height="48" width="48" alt="gongleii"></a>

🍉 Introduction

On the basis of ADGC, I refactored the code to make the deep clustering code achieve a higher level of unification. Specifically, I redesigned the architecture of the code, so that you can run the open source code easily. I defined some tool classes and functions to simplify the code and make the settings' configuration clear.

🍓 Quick Start

After git clone the code, you can follow the steps below to run:

✈️ Step 1: Check the environment or run the requirements.txt to install the libraries directly.

pip install -r requirements.txt

✈️ Step 2: Prepare the datasets. If you don't have the datasets, you can download them from Liu's warehouse [yueliu1999 | Google Drive | Nutstore]. Then unzip them to the dataset directory.

✈️ Step 3: Run the file in the directory where main.py is located in command line. If it is in the integrated compilation environment, you can directly run the main.py file.

:star: Examples

Example 1

Take the training of the DAEGC as example:

:one: pretrain GAT:

python main.py --pretrain --model pretrain_gat_for_daegc --dataset acm  --t 2 --desc pretrain_the_GAT_for_DAEGC_on_acm
# or the simplified command:
python main.py -P -M pretrain_gat_for_daegc -D acm -T 2 -DS pretrain_the_GAT_for_DAEGC_on_acm

:two: train DAEGC:

python main.py --model DAEGC --dataset cora --t 2 -desc Train_DAEGC_1_iteration_on_the_ACM_dataset
# or the simplified command:
python main.py -M DAEGC -D cora -T 2 -DS Train_DAEGC_1_iteration_on_the_ACM_dataset

Example 2

Take the training of the SDCN as example:

:one: pretrain AE:

python main.py --pretrain --model pretrain_ae_for_sdcn --dataset acm --desc pretrain_ae_for_SDCN_on_acm
# or simplified command:
python main.py -P -M pretrain_ae_for_sdcn -D acm -DS pretrain_ae_for_SDCN_on_acm

:two: train SDCN:

python main.py --model SDCN --dataset acm --norm --desc Train_SDCN_1_iteration_on_the_ACM_dataset
# or simplified command:
python main.py -M SDCN -D acm -N  -DS Train_SDCN_1_iteration_on_the_ACM_dataset

✈️ Step 4: If you run the code successfully, don't forget give me a star! :wink:

🔓 Currently Supported Models

No.ModelPaperSource Code
1DAEGC《Attributed Graph Clustering: <br> A Deep Attentional Embedding Approach》link
2SDCN《Structural Deep Clustering Network》link
3AGCN《Attention-driven Graph Clustering Network》link
4EFR-DGC《Deep Graph clustering with enhanced <br> feature representations for community detection》link
5GCAE:exclamation: ​In fact, it's GAE with GCN.-
6DFCN《Deep Fusion Clustering Network》link
7HSAN《Hard Sample Aware Network for <br>Contrastive Deep Graph Clustering》link
8DCRN《Deep Graph Clustering via<br> Dual Correlation Reduction》link
9CCGC《Cluster-guided Contrastive <br>Graph Clustering Network》link
10AGC-DRR《Attributed Graph Clustering <br>with Dual Redundancy Reduction》link

:exclamation: Attention

  1. The training process of DFCN are divided into three stages according to the paper. First, pretrain pretrain_ae_for_dfcn and pretrain_igae_for_dfcn separately for 30 epochs. Second, pretrain ae and igae simultaneously for 100 epochs which are both integrated into pretrain_both_for_dfcn. Finally, train DFCN formally at least 200 epochs. So is DCRN!
  2. The HSAN model does not require pretraining.
  3. The results in the DCRN paper have not yet been reproduced, and will continue to be updated in the future.

In the future, I plan to update the other models. If you find my framework useful, feel free to contribute to its improvement by submitting your own code.

🔓 TODO

No.ModelPaperSource Code
1SCGC《Simple Contrastive Graph Clustering》link
2Dink-Net《Dink-Net: Neural Clustering on Large Graphs》link

:robot: ​Commands

:alien: ​DAEGC

# pretrain
python main.py -P -M pretrain_gat_for_daegc -D acm -T 2 -DS balabala -LS 1
# train
python main.py -M DAEGC -D acm -T 2 -DS balabala -LS 1 -TS -H

:alien: ​SDCN

# pretrain
python main.py -P -M pretrain_ae_for_sdcn -D acm -DS balabala -LS 1
# train
python main.py -M SDCN -D acm -N -DS balabala -LS 1 -TS -H

:alien: ​AGCN

# pretrain
python main.py -P -M pretrain_ae_for_agcn -D acm -DS balabala -LS 1
# train
python main.py -M AGCN -D acm -N -SF -DS balabala -LS 1 -TS -H

:alien: ​EFR-DGC

# pretrain
python main.py -P -M pretrain_ae_for_efrdgc -D acm -DS balabala -LS 1
python main.py -P -M pretrain_gat_for_efrdgc -D acm -T 2 -DS balabala -LS 1
# train
python main.py -M EFRDGC -D acm -T 2 -DS balabala -LS 1 -TS -H

:alien: ​GCAE

# pretrain
python main.py -P -M pretrain_gae_for_gcae -D acm -N -DS balabala -LS 1
# train
python main.py -M GCAE -D acm -N -DS balabala -LS 1 -TS -H

:alien: ​DFCN

# pretrain. Execute the following commands in sequence.
python main.py -P -M pretrain_ae_for_dfcn -D acm -DS balabala -LS 1
python main.py -P -M pretrain_igae_for_dfcn -D acm -N -DS balabala -LS 1
python main.py -P -M pretrain_both_for_dfcn -D acm -N -DS balabala -LS 1
# train
python main.py -M DFCN -D acm -N -DS balabala -LS 1 -TS -H

:alien: HSAN

# train
python main.py -M HSAN -D cora -SLF -A npy -F npy -DS balabala -LS 1 -TS

:alien: DCRN

# pretrain. Execute the following commands in sequence.
python main.py -P -M pretrain_ae_for_dcrn -D acm -S 1 -DS balabala -LS 1
python main.py -P -M pretrain_igae_for_dcrn -D acm -N -SF -S 1 -DS balabala -LS 1
python main.py -P -M pretrain_both_for_dcrn -D acm -N -SF -S 1 -DS balabala -LS 1
# train
python main.py -M DCRN -D acm -SLF -A npy -S 3 -DS balabala -LS 1 -TS -H

:alien: CCGC

python main.py -M CCGC -D acm -SLF -SF -A npy -S 0 -LS 1 -DS balabala

:alien: AGC-DRR

python main.py -M AGCDRR -D acm -F npy -S 0 -LS 1 -DS balabala

🍊 Advanced

:exclamation: ​Arguments

🥤 Help

> python main.py --help
usage: main.py [-h] [-P] [-TS] [-H] [-N] [-SLF] [-SF] [-DS DESC]
               [-M MODEL_NAME] [-D DATASET_NAME] [-R ROOT] [-K K] [-T T]
               [-LS LOOPS] [-F {tensor,npy}] [-L {tensor,npy}]
               [-A {tensor,npy}] [-S SEED]

Scalable Unified Framework of Deep Graph Clustering

optional arguments:
  -h, --help            show this help message and exit
  -P, --pretrain        Whether to pretrain. Using '-P' to pretrain.
  -TS, --tsne           Whether to draw the clustering tsne image. Using '-TS'
                        to draw clustering TSNE.
  -H, --heatmap         Whether to draw the embedding heatmap. Using '-H' to
                        draw embedding heatmap.
  -N, --norm            Whether to normalize the adj, default is False. Using
                        '-N' to load adj with normalization.
  -SLF, --self_loop_false
                        Whether the adj has self-loop, default is True. Using
                        '-SLF' to load adj without self-loop.
  -SF, --symmetric_false
                        Whether the normalization type is symmetric. Using
                        '-SF' to load asymmetric adj.
  -DS DESC, --desc DESC
                        The description of this experiment.
  -M MODEL_NAME, --model MODEL_NAME
                        The model you want to run.
  -D DATASET_NAME, --dataset DATASET_NAME
                        The dataset you want to use.
  -R ROOT, --root ROOT  Input root path to switch relative path to absolute.
  -K K, --k K           The k of KNN.
  -T T, --t T           The order in GAT. 'None' denotes don't calculate the
                        matrix M.
  -LS LOOPS, --loops LOOPS
                        The Number of training rounds.
  -F {tensor,npy}, --feature {tensor,npy}
                        The datatype of feature. 'tenor' and 'npy' are
                        available.
  -L {tensor,npy}, --label {tensor,npy}
                        The datatype of label. 'tenor' and 'npy' are
                        available.
  -A {tensor,npy}, --adj {tensor,npy}
                        The datatype of adj. 'tenor' and 'npy' are available.
  -S SEED, --seed SEED  The random seed. The default value is 0.

🍹 Details

Here are the details of argparse arguments you can change:

tagargumentsshortdescriptiontype/actiondefault
🟥<span style="color: red">--pretrain</span>-PWhether this training is pretraining."store_true"False
🟩<span style="color: green">--tsne</span>-TSIf you want to draw the clustering result with scatter, <br> you can use it."store_true"False
🟩<span style="color: green">--heatmap</span>-HIf you want to draw the heatmap of the embedding <br> representation learned by model, you can use it."store_true"False
🟥<span style="color: red">--norm</span>-NWhether to normalize the adj, default is False.<br> Using '-N' to load adj with normalization."store_true"False
🟦<span style="color: blue">--self_loop_false</span>-SLFWhether the adj has self-loop, default is True.<br> Using '-SLF' to load adj without self-loop."store_false"True
🟦<span style="color: blue">--symmetric_false</span>-SFWhether the normalization type is symmetric.<br> Using '-SF' to load asymmetric adj."store_false"True
🟥<span style="color: red">--model</span>-MThe model you want to train. <br> Should correspond to the model in the model directory.str"SDCN"
🟥<span style="color: red">--dataset</span>-DThe dataset you want to train. <br> Should correspond to the dataset name in the dataset directory.str"acm"
🟦<span style="color: blue">--k</span>-KFor graph dataset, it is set to None. <br> If the dataset is not graph type, <br> you should set k to construct 'KNN' graph of dataset.intNone
🟦<span style="color: blue">--t</span>-TIf the model need to get the matrix M, such as DAEGC, <br> you should set t according to the paper. <br> None denotes the model needn't M.intNone
🟥<span style="color: red">--loops</span>-LSThe training times. If you want to train the model <br> for 10 times, you can set it to 10.int1
🟥<span style="color: red">--root</span>-RIf you need to change the relative path to the <br> absolute path, you can set it to root path.strNone
🟪<span style="color: purple">--desc</span>-DSThe description of this experiment.str"default"
🟦<span style="color: blue">--feature</span>-FThe datatype of feature.<br> 'tenor' and 'npy' are available.str"tensor"
🟦<span style="color: blue">--label</span>-LThe datatype of label.<br> 'tenor' and 'npy' are available.str"npy"
🟦<span style="color: blue">--adj</span>-AThe datatype of adj.<br> 'tenor' and 'npy' are available.str"tensor"
🟥<span style="color: red">--seed</span>-SThe random seed. It is 0 if not specified.int0

💡 Tips:

🧩 Scalability

Strong scalability is a prominent feature of this framework. If you want to run your own code in this framework, you can follow the steps:

🐯 Model Extension

🚄 Step 1: Write a model file model.py using Pytorch and a training function file train.py and then put them into a directory named after the uppercase of model name. Then put it into the model directory. We provide the template file in the template directory.

🚄 Step 2: If your model need to be pretrained, you need to write a pretraining file train.py and put it into a directory named after pretrain_{module(lowercase)} _for_{model (lowercase)}, then put it into the model directory. We provide the template file in the template directory.

🚄 Step 3: Modify the pretrain_type_dict in line 38 in path_manager.py. The format is "model name(uppercase)": [items]. If your model needn't be pretrained, let the list null. Otherwise, you should list all modules you need to pretrain. For example, if you want to pretrain AE module, you should add "pretrain_ae" to the list. Meanwhile, please check whether the pretrain type exists in if-else sentence, if not, please add it manually.

🚄 Step 4: Run your code!

🐴 Dataset Extension

🚌 Step 1: Make sure that your dataset are well processed and the file suffix is 'npy' which denotes the file store the numpy array. If your dataset is graph data, you need to include {dataset name}_feat.npy、{dataset name}_label.npy、{dataset name}_adj.npy. If your dataset is non-graph data, there are two ways to handle. One is directly using {dataset name}_feat.npy、{dataset name}_label.npy, and set the type of constructing graph in line 167 in load_data.py. If the construct type not exists, please add it to the function construct_graph in data\_processor.py. Another is to construct graph data manually, and use {dataset name}_feat.npy、{dataset name}_label.npy、{dataset name}_adj.npy, but you need remember what value the k used because the dataset is considered as graph dataset.

🚌 Step 2: Put the file above to a directory named after the lowercase of dataset name. Then put them into the dataset directory.

🚌 Step 3: Add the information about the dataset in the dataset_info.py.

🚌 Step 4: Use your dataset!

🍎 Ending

Graph deep clustering is currently in a stage of rapid development, and more graph clustering methods will be proposed in the future. Therefore, providing a unified code framework can save researchers' coding and experiment time, and put more energy on the theoretical innovation. It is believed that graph clustering will reach a higher level in the future.

If this repository is helpful to you, please remember to Star~😘.

Citation

If you use our code, please cite these papers:

@article{ding2023graph,
title = {Graph clustering network with structure embedding enhanced},
journal = {Pattern Recognition},
volume = {144},
pages = {109833},
year = {2023},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2023.109833},
url = {https://www.sciencedirect.com/science/article/pii/S0031320323005319},
author = {Shifei Ding and Benyu Wu and Xiao Xu and Lili Guo and Ling Ding},
}

@article{ding2024towards,
author = {Ding, Shifei and Wu, Benyu and Ding, Ling and Xu, Xiao and Guo, Lili and Liao, Hongmei and Wu, Xindong},
title = {Towards Faster Deep Graph Clustering via Efficient Graph Auto-Encoder},
year = {2024},
issue_date = {September 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {18},
number = {8},
issn = {1556-4681},
url = {https://doi.org/10.1145/3674983},
doi = {10.1145/3674983},
journal = {ACM Trans. Knowl. Discov. Data},
month = {aug},
articleno = {202},
numpages = {23},
}