Home

Awesome

<div align="center"> <h2><a href="https://arxiv.org/pdf/2305.18405.pdf">Dink-Net: Neural Clustering on Large Graphs</a></h2>

Yue Liu<sup>1,2</sup>, Ke Liang<sup>1</sup>, Jun Xia<sup>2</sup>, Sihang Zhou<sup>1</sup>, Xihong Yang<sup>1</sup>, Xinwang Liu<sup>1</sup>, Stan Z. Li<sup>2</sup>

<sup>1</sup>National University of Defense Technology, <sup>2</sup>Westlake University

</div> <p align="center"> <a href="https://pytorch.org/" alt="PyTorch"> <img src="https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?e&logo=PyTorch&logoColor=white" /> </a> <a href="https://icml.cc/Conferences/2023" alt="Conference"> <img src="https://img.shields.io/badge/ICML'23-brightgreen" /> </a> </p> <p align = "justify"> Deep graph clustering, which aims to group the nodes of a graph into disjoint clusters with deep neural networks, has achieved promising progress in recent years. However, the existing methods fail to scale to the large graph with million nodes. To solve this problem, a scalable deep graph clustering method (<i>Dink-Net</i>) is proposed with the idea of <u>di</u>lation and shri<u>nk</u>. Firstly, by discriminating nodes, whether being corrupted by augmentations, representations are learned in a self-supervised manner. Meanwhile, the cluster centers are initialized as learnable neural parameters. Subsequently, the clustering distribution is optimized by minimizing the proposed cluster dilation loss and cluster shrink loss in an adversarial manner. By these settings, we unify the two-step clustering, i.e., representation learning and clustering optimization, into an end-to-end framework, guiding the network to learn clustering-friendly features. Besides, <i>Dink-Net</i> scales well to large graphs since the designed loss functions adopt the mini-batch data to optimize the clustering distribution even without performance drops. Both experimental results and theoretical analyses demonstrate the superiority of our method. </p>

stars forks  issues  visitors

<details> <summary>Table of Contents</summary> <ol> <li><a href="#Usage">Usage</a></li> <li><a href="#acknowledgement">Acknowledgement</a></li> <li><a href="#citation">Citation</a></li> </ol> </details>

Usage

Datasets

DatasetType# Nodes# Edges# Feature Dimensions# Classes
CoraAttribute Graph2,7085,2781,4337
CiteSeerAttribute Graph3,3274,6143,7036
Amazon-PhotoAttribute Graph7,650119,0817458
ogbn-arxivAttribute Graph169,3431,166,24312840
RedditAttribute Graph232,96523,213,83860241
ogbn-productsAttribute Graph2,449,02961,859,14010047
ogbn-papers100MAttribute Graph111,059,9561,615,685,872128172

Requirements

codes are tested on Python3.7

dgl-cu113==0.9.1.post1
munkres==1.1.4
networkx==2.8.3
numpy==1.23.2
scikit_learn==1.3.0
scipy==1.6.0
torch==2.0.1
torch-scatter==2.0.9
torch-sparse==0.6.12
torch-spline-conv==1.2.1
torch-geometric==2.1.0.post1
tqdm==4.65.0
wandb=0.15.4
ogb==1.3.6

Configurations

--device     |  running device
--dataset    |  dataset name
--hid_units  |  hidden units
--activate   |  activation function
--tradeoff   |  tradeoff parameter
--lr         |  learning rate
--epochs     |  training epochs
--eval_inter |  evaluation interval
--wandb      |  wandb logging

Quick Start

clone this repository and change directory to Dink-Net

git clone https://github.com/yueliu1999/Dink-Net.git
cd ./Dink-Net

unzip the datasets and model parameters

unzip -d ./data/ ./data/datasets.zip
unzip -d ./models/ ./models/models.zip

run codes with scripts

bash ./scripts/train_cora.sh

bash ./scripts/train_citeseer.sh

bash ./scripts/train_amazon_photo.sh

bash ./scripts/train_ogbn-arxiv.sh

or directly run codes with commands

python main.py --device cuda:0 --dataset cora --hid_units 512 --lr 1e-2 --epochs 200 --wandb

python main.py --device cuda:0 --dataset citeseer --hid_units 1536 --lr 5e-4 --epochs 200 --wandb

python main.py --device cuda:0 --dataset amazon_photo --hid_units 512 --lr 1e-2 --epochs 100  --eval_inter 1 --wandb

python main.py --device cuda:0 --dataset ogbn_arxiv --hid_units 1500 --encoder_layer 3 --lr 1e-4 --epochs 30 --batch_size 8192 --batch_train --eval_inter 1 --wandb

tips: remove "--wandb" to disable wandb logging if logging error happened.

Results

<img src="./assets/main_results.png" alt="main_results" style="zoom:61%;" /> <p style="text-align:justify; text-justify:inter-ideograph;"> Table 1. Clustering performance (%) of our method and fourteen state-of-the-art baselines. The bold and underlined values are the best and the runner-up results. “OOM” indicates that the method raises the out-of-memory failure. “-” denotes that the methods do not converge. </p>

main_results_vis

<p align="center"> Figure 1. <i>t</i>-SNE visualization of seven methods on the Cora dataset. </p>

Acknowledgements

Our code are partly based on the following GitHub repository. Thanks for their awesome works.

Pretraining

pretrain Dink-Net on your own dataset. Refer to here.

Citations

If you find this repository helpful, please cite our paper.

@inproceedings{Dink-Net,
  title={Dink-Net: Neural Clustering on Large Graphs},
  author={Liu, Yue and Liang, Ke and Xia, Jun and Zhou, Sihang and Yang, Xihong and Liu, Xinwang and Li, Stan Z.},
  booktitle={International Conference on Machine Learning},
  year={2023},
  organization={PMLR}
}
<p align="right">(<a href="#top">back to top</a>)</p>