[论文精读]Semisupervised Graph Neural Networks for Graph Classification

最新推荐文章于 2024-09-30 14:16:32 发布

夏莉莉iy

最新推荐文章于 2024-09-30 14:16:32 发布

阅读量548

点赞数 21

分类专栏：论文精读文章标签：人工智能深度学习机器学习计算机视觉笔记算法图论

本文链接：https://blog.csdn.net/Sherlily/article/details/142554780

版权

论文精读专栏收录该内容

76 篇文章 9 订阅

订阅专栏

论文网址：Semisupervised Graph Neural Networks for Graph Classification | IEEE Journals & Magazine | IEEE Xplore

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3. Related Work and Motivation

2.3.1. Graph Neural Networks for Graph Classification

2.3.2. Semisupervised Learning in Graph Classification

2.3.3. Few-Shot Learning in Graph Classification

2.4. Methodology

2.4.1. Problem Formulation and Notations

2.4.2. Framework

2.4.3. Implementation on Typical Graph Classification

2.4.4. Implementation on Few-Shot Graph Classification

2.5. Experimental Study

2.5.1. Datasets

2.5.2. Configurations of Graph Neural Networks

2.5.3. Experiments on Typical Graph Classification

2.5.4. Experiments on Few-Shot Graph Classification

2.5.5. Ablation Study

2.5.6. Parameter Sensitivity Analysis

2.6. Conclusion

3. Reference

1. 省流版

1.1. 心得

（1）还行，不算是很晦涩的论文，只是没有代码会让这个设计较难复现

2. 论文逐段精读

2.1. Abstract

①GNNs have achieve SOTA results on pure supervised area

②Semi-supervise usually applied on node classification

③They trained 2 GNNs as complementary views

2.2. Introduction

①Graph kernel methods are two stage methods and cost time

②Existing performances rely on labeled data

③They propopsed a semisupervised GNN framework for graph classification based on co-training and self-training

arduous adj.艰苦的；艰难的

2.3. Related Work and Motivation

2.3.1. Graph Neural Networks for Graph Classification

①The authors list PATCHY-SAN, MPNN, DGCNN, GAM, DIFFPOOL and GIN

②These experiments do not contain unlabeled data

2.3.2. Semisupervised Learning in Graph Classification

①Semi supervised methods: self-training, co-training, label propagation

②Assumption of semi-supervised method: smoothness assumption, the cluster assumption, and the manifold assumption

③⭐There is no connection between graphs (edge will connects nodes)

2.3.3. Few-Shot Learning in Graph Classification

①Examples a prototype model which extract the average of samples in each class:

$Pr_n=\frac1{|\mathcal{S}_n|}\sum_{(x_i,y_i)\in\mathcal{S}_n}f(x_i)$

②The probability that a graph belongs to a class:

$P(y=n|x_i)=\frac{\exp(-d\left(f(x_i),Pr_n\right))}{\sum_{j=1}^N\exp(-d(f(x_i),Pr_j))}$

2.4. Methodology

2.4.1. Problem Formulation and Notations

（1）Supervised Graph Classification:

①Training set: $\mathcal{D}_{\mathrm{training}}=\{(x_1,y_1),\ldots,(x_l,y_l)\}$

②Mapping test data $\mathcal{D}_{test}=\{(x_{l+1}),\ldots,(x_{l+u})\}$ to label

（2）Semisupervised Graph Classification:

①Training set: $\mathcal{D}_{\mathrm{training}}=\{(x_1,y_1),\ldots,(x_l,y_l),(x_{l+1}),\ldots,(x_{l+u})\}$

②Common notations:

2.4.2. Framework

①They adopt pre-trained strategy in the first $num_{pre}$ epoch

②For the same graph, the 2 classifications measured by Jensen–Shannon divergence

$\begin{aligned}\ell_{JS}(x;\Theta1,\Theta2)&=\sum_{x_{i}\in\mathcal{D}_{U}}\left(H\left(\frac{1}{2}(Z1(x_{i})+Z2(x_{i}))\right)\right)\\&- \frac{1}{2}(H(Z1(x_{i}))+H(Z2(x_{i})))\end{aligned}$

where $Z_1$ and $Z_2$ denote 2 softmax score outputed by the 2 classifiers

③Supervised loss is applied on labeled data:

$\mathcal{L}_s(x,y;\Theta) = \sum_{x_i\in\mathcal{D}_L}\ell_{CE}(\operatorname{argmax}(Z(x_i)),y_i)$

④Total training loss for the two GNNs:

$\begin{aligned}\mathcal{L}_{\mathrm{pre}}&=\mathcal{L}_{s}(x,y;\Theta1)+\mathcal{L}_{s}(x,y;\Theta2)\\&+ \lambda_{JS}\ell_{JS}(x;\Theta1,\Theta2)\end{aligned}$

⑤Two GNNs from different views arrange pseudo labels for each other

⑥They assign weight to each unlabeled sample:

$\omega_i=1-\frac{H(Z(x_i))}{\log(N)}$

where $H\left ( \right )$ denotes the entropy function, $log\left ( N \right )$ is the maximum possible entropy in $\mathbb{R}^N$

⑦To weakening the impact of category imbalance, they add another weight $\gamma_{j}, j=1,\ldots,N$ , and the weight is defined by:

$\gamma_{j}=(|L_{j}|+|U_{j}|)^{-1}$

where $L_j$ denotes the number of labeled samples in $\mathcal{D}_L$ and $U_j$ denotes the number of pseudo labeled samples in $\mathcal{D}_U$

⑧Loss for unlabeled data (by minimizing the pseudo labeled samples loss):

$\mathcal{L}_{\mathrm{pseudo}}(x,\widehat{y};\Theta) = \sum_{x_{i}\in\mathcal{D}_{U}}\omega_{i}\gamma_{\widehat{y}_{i}}\ell_{CE}(\mathrm{argmax}(Z(x_{i})),\widehat{y}_{i})$

where $\widehat{y_{i}}$ denotes the pseudo label for unlabeled data assigned by the other view $\mathcal{L}_{\mathrm{self}}=\mathcal{L}_{\mathrm{pseudo_self}}(x,\widehat{y}1;\Theta1)+\mathcal{L}_{\mathrm{pseudo_self}}(x,\widehat{y}2;\Theta2)$

⑨The overall loss function on co-training:

$\begin{aligned}\mathcal{L}_{co}&=\lambda_{co}\mu\mathcal{L}_{s}(x,y;\Theta1)+\mathcal{L}_{\mathrm{pseudo}}(x,\widehat{y}2;\Theta1)\\&+ \lambda_{co}\mu\mathcal{L}_{s}(x,y; \Theta2)+\mathcal{L}_{\mathrm{pseudo}}(x,\widehat{y}1; \Theta2)\end{aligned}$

where the $\lambda _{co}$ denotes the tradeoff factor between true labels and pseudo samples, $\mu$ denotes another weight for true labeled examples and $\mu=|\mathcal{D}_{mb^{\prime}}|/|\mathcal{D}_{mb^{\prime}}\cap\mathcal{D}_{L}|$

⑩They reset pseudo labeled samples every $\beta$ epochs to impair the harm of accumulated errors

⑪The supervised loss on pseudo labeled samples on self-training:

$\mathcal{L}_{\text{pseudo self}}(x,\widehat{y};\Theta)=\sum_{x_{i}\in\mathcal{D}_{L^{\prime}}}\ell_{CE}(\text{argmax}(Z(x_{i})),\widehat{y}_{i})$

where $\widehat{y_{i}}$ denotes the pseudo label for unlabeled data assigned by their own view

⑫The overall self training loss function:

$\mathcal{L}_{\mathrm{self}}=\mathcal{L}_{\mathrm{pseudo_self}}(x,\widehat{y}1;\Theta1)+\mathcal{L}_{\mathrm{pseudo_self}}(x,\widehat{y}2;\Theta2)$

⑬The overall loss in the model:

$\mathcal{L}=\mathcal{L}_{co}+\mathcal{L}_{\mathrm{self}}$

⑭The workflow of this model:

⑮Algorithm of this model:

ameliorate vt.改善；改进；改良

2.4.3. Implementation on Typical Graph Classification

①This framework can corporate with any GNN

2.4.4. Implementation on Few-Shot Graph Classification

①They combine their framework with prototypical network:

$\widehat{P}(y=n|x_i)=\frac{\exp\bigl(-d\bigl(f(x_i),\widehat{Pr}_n\bigr)\bigr)}{\sum_{j=1}^N\exp\bigl(-d\bigl(f(x_i),\widehat{Pr}_j\bigr)\bigr)}$

②Framework applied in few shot classification:

③Pseudo label generation:

2.5. Experimental Study

2.5.1. Datasets

①7 classic graph classification datasets: NCI1, NC109, D&D, COLLAB, REDDIT-MULTI-12K, MiniGCDataset, and DBLP_v1. NCI1 and NCI109

②Statistics of classic graph classification datasets:

③2 small sample datasets: mini-REM12K and mini-MGCD

④Statistics of few shot graph classification datasets:

2.5.2. Configurations of Graph Neural Networks

①They chose DIFFPOOL and GIN as the two GNNs, which DIFFPOOL extracts the topological structure and GIN keeps the high order neighbor relationship

②Hyper-parameter optimization: grid search

2.5.3. Experiments on Typical Graph Classification

①Labeling rate: 0.5% and 1% on MiniGCDataset, 5% and 10% for others

②Evaluation: average performance over 10 runs

（1）Parameter Configurations

①Training epoch: 300 for original GNNs and 200 for their semisupervised GNNs

② $\lambda _{co}=0.001$

③ $num_{pre}=30$

④If $epoch < num_{wmup}$ : $\lambda_{JS}=\lambda_{JS_-\max} \exp(-5*(1-\mathrm{epoch/num}_{wmup})^2$ , otherwise $\lambda_{JS}=\lambda_{JS_-\max}$ . $\lambda_{JS_-\max}=10$ , $num_{wmup}=30$

⑤ $top_k=5$

⑥ $\beta =5$

⑦Learning rate=0.001, decrese 0.5 at each 80 epochs

（2）Baseline Methods

①DIFFPOOL+ and GIN+: generated by SVM

②Strong non-GNN methods: graph2vec, Skip-Gram, RGM

（3）Results and Analysis

①Experimental results:

2.5.4. Experiments on Few-Shot Graph Classification

（1）Parameter Configurations

① $\lambda _{f_s}=20$

② $top_k=3$

③ $\lambda _{JS}$ is the same as in the typical graph

④Training epoch: 40

⑤learning rate: 0.001

（2）Baseline Methods

①Similar to typical graph

（3）Results and Analysis

①Performance table:

2.5.5. Ablation Study

①Module ablation study:

2.5.6. Parameter Sensitivity Analysis

① $\alpha_{DIFFPOOL}$ is the number of clusters after soft coarsening in DIFFPOOL and $\alpha_{DIFFPOOL} \in \left [ 0.05,3 \right ]$ with an increment of 0.05

② $\alpha_{GIN}$ denotes the number of GNN layers and it varies from 3 to 7

③Hyperparameter varing observation:

④ $\lambda _{co}\in\quad\{0.0001,0.0002,0.0005,0.001,0.002,0.005,0.01\}$ on NCI1:

⑤ $\lambda _{fs}\in\{1,2,5,10,20,50,100,200,500,1000\}$ on mini-REM12K and mini-MGCD:

2.6. Conclusion

They want further explore the noisy labels

3. Reference

Xie, Y. et al. (2023) 'Semisupervised Graph Neural Networks for Graph Classification', IEEE Transactions on Cybernetics, 53(10): 6222-6235. doi: 10.1109/TCYB.2022.3164696