1. 省流版

1.1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work and Motivation

2.3.1. Graph Neural Networks for Graph Classification

2.3.2. Semisupervised Learning in Graph Classification

2.3.3. Few-Shot Learning in Graph Classification

2.4. Methodology

2.4.1. Problem Formulation and Notations

2.4.2. Framework

2.4.3. Implementation on Typical Graph Classification

2.4.4. Implementation on Few-Shot Graph Classification

2.5. Experimental Study

2.5.1. Datasets

2.5.2. Configurations of Graph Neural Networks

2.5.3. Experiments on Typical Graph Classification

2.5.4. Experiments on Few-Shot Graph Classification

2.5.5. Ablation Study

2.5.6. Parameter Sensitivity Analysis

2.6. Conclusion

3. Reference

2.1. Abstract

        ①GNNs have achieve SOTA results on pure supervised area

        ②Semi-supervise usually applied on node classification

        ③They trained 2 GNNs as complementary views

2.2. Introduction

        ①Graph kernel methods are two stage methods and cost time

        ②Existing performances rely on labeled data

        ③They propopsed a semisupervised GNN framework for graph classification based on co-training and self-training

arduous  adj.艰苦的;艰难的

2.3. Related Work and Motivation

2.3.1. Graph Neural Networks for Graph Classification

        ①The authors list PATCHY-SAN, MPNN, DGCNN, GAM, DIFFPOOL and GIN

        ②These experiments do not contain unlabeled data

2.3.2. Semisupervised Learning in Graph Classification

        ①Semi supervised methods: self-training, co-training, label propagation

        ②Assumption of semi-supervised method: smoothness assumption, the cluster assumption, and the manifold assumption

        ③⭐There is no connection between graphs (edge will connects nodes)

2.3.3. Few-Shot Learning in Graph Classification

        ①Examples a prototype model which extract the average of samples in each class:


        ②The probability that a graph belongs to a class:


2.4. Methodology

2.4.1. Problem Formulation and Notations

(1)Supervised Graph Classification:

        ①Training set: \mathcal{D}_{\mathrm{training}}=\{(x_1,y_1),\ldots,(x_l,y_l)\}

        ②Mapping test data \mathcal{D}_{test}=\{(x_{l+1}),\ldots,(x_{l+u})\} to label 

(2)Semisupervised Graph Classification:

        ①Training set: \mathcal{D}_{\mathrm{training}}=\{(x_1,y_1),\ldots,(x_l,y_l),(x_{l+1}),\ldots,(x_{l+u})\}

        ②Common notations:

2.4.2. Framework

        ①They adopt pre-trained strategy in the first num_{pre} epoch

        ②For the same graph, the 2 classifications measured by Jensen–Shannon divergence

\begin{aligned}\ell_{JS}(x;\Theta1,\Theta2)&=\sum_{x_{i}\in\mathcal{D}_{U}}\left(H\left(\frac{1}{2}(Z1(x_{i})+Z2(x_{i}))\right)\right)\\&- \frac{1}{2}(H(Z1(x_{i}))+H(Z2(x_{i})))\end{aligned}

where Z_1 and Z_2 denote 2 softmax score outputed by the 2 classifiers

        ③Supervised loss is applied on labeled data:

\mathcal{L}_s(x,y;\Theta) = \sum_{x_i\in\mathcal{D}_L}\ell_{CE}(\operatorname{argmax}(Z(x_i)),y_i)

        ④Total training loss for the two GNNs:

\begin{aligned}\mathcal{L}_{\mathrm{pre}}&=\mathcal{L}_{s}(x,y;\Theta1)+\mathcal{L}_{s}(x,y;\Theta2)\\&+ \lambda_{JS}\ell_{JS}(x;\Theta1,\Theta2)\end{aligned}

        ⑤Two GNNs from different views arrange pseudo labels for each other

        ⑥They assign weight to each unlabeled sample:


where H\left ( \right ) denotes the entropy function, log\left ( N \right ) is the maximum possible entropy in \mathbb{R}^N

        ⑦To weakening the impact of category imbalance, they add another weight \gamma_{j}, j=1,\ldots,N, and the weight is defined by:


where L_j denotes the number of labeled samples in \mathcal{D}_L and U_j denotes the number of pseudo labeled samples in \mathcal{D}_U

        ⑧Loss for unlabeled data (by minimizing the pseudo labeled samples loss):

\mathcal{L}_{\mathrm{pseudo}}(x,\widehat{y};\Theta) = \sum_{x_{i}\in\mathcal{D}_{U}}\omega_{i}\gamma_{\widehat{y}_{i}}\ell_{CE}(\mathrm{argmax}(Z(x_{i})),\widehat{y}_{i})

where \widehat{y_{i}} denotes the pseudo label for unlabeled data assigned by the other view\mathcal{L}_{\mathrm{self}}=\mathcal{L}_{\mathrm{pseudo_self}}(x,\widehat{y}1;\Theta1)+\mathcal{L}_{\mathrm{pseudo_self}}(x,\widehat{y}2;\Theta2)

        ⑨The overall loss function on co-training:

\begin{aligned}\mathcal{L}_{co}&=\lambda_{co}\mu\mathcal{L}_{s}(x,y;\Theta1)+\mathcal{L}_{\mathrm{pseudo}}(x,\widehat{y}2;\Theta1)\\&+ \lambda_{co}\mu\mathcal{L}_{s}(x,y; \Theta2)+\mathcal{L}_{\mathrm{pseudo}}(x,\widehat{y}1; \Theta2)\end{aligned}

where the \lambda _{co} denotes the tradeoff factor between true labels and pseudo samples, \mu denotes another weight for true labeled examples and \mu=|\mathcal{D}_{mb^{\prime}}|/|\mathcal{D}_{mb^{\prime}}\cap\mathcal{D}_{L}|

        ⑩They reset pseudo labeled samples every \beta epochs to impair the harm of accumulated errors

        ⑪The supervised loss on pseudo labeled samples on self-training:

\mathcal{L}_{\text{pseudo self}}(x,\widehat{y};\Theta)=\sum_{x_{i}\in\mathcal{D}_{L^{\prime}}}\ell_{CE}(\text{argmax}(Z(x_{i})),\widehat{y}_{i})

where \widehat{y_{i}} denotes the pseudo label for unlabeled data assigned by their own view

        ⑫The overall self training loss function:


        ⑬The overall loss in the model:


        ⑭The workflow of this model:

        ⑮Algorithm of this model:

ameliorate  vt.改善;改进;改良

2.4.3. Implementation on Typical Graph Classification

        ①This framework can corporate with any GNN

2.4.4. Implementation on Few-Shot Graph Classification

        ①They combine their framework with prototypical network:


        ②Framework applied in few shot classification:

        ③Pseudo label generation:

2.5. Experimental Study

2.5.1. Datasets

        ①7 classic graph classification datasets: NCI1, NC109, D&D, COLLAB, REDDIT-MULTI-12K, MiniGCDataset, and DBLP_v1. NCI1 and NCI109

        ②Statistics of classic graph classification datasets:

        ③2 small sample datasets: mini-REM12K and mini-MGCD

        ④Statistics of few shot graph classification datasets:

2.5.2. Configurations of Graph Neural Networks

        ①They chose DIFFPOOL and GIN as the two GNNs, which DIFFPOOL extracts the topological structure and GIN keeps the high order neighbor relationship

        ②Hyper-parameter optimization: grid search

2.5.3. Experiments on Typical Graph Classification

        ①Labeling rate: 0.5% and 1% on MiniGCDataset, 5% and 10% for others

        ②Evaluation: average performance over 10 runs

(1)Parameter Configurations

        ①Training epoch: 300 for original GNNs and 200 for their semisupervised GNNs

        ②\lambda _{co}=0.001


        ④If epoch < num_{wmup}\lambda_{JS}=\lambda_{JS_-\max} \exp(-5*(1-\mathrm{epoch/num}_{wmup})^2, otherwise \lambda_{JS}=\lambda_{JS_-\max}\lambda_{JS_-\max}=10num_{wmup}=30


        ⑥\beta =5

        ⑦Learning rate=0.001, decrese 0.5 at each 80 epochs

(2)Baseline Methods

        ①DIFFPOOL+ and GIN+: generated by SVM

        ②Strong non-GNN methods: graph2vec, Skip-Gram, RGM

(3)Results and Analysis

        ①Experimental results:

2.5.4. Experiments on Few-Shot Graph Classification

(1)Parameter Configurations

        ①\lambda _{f_s}=20


        ③\lambda _{JS} is the same as in the typical graph

        ④Training epoch: 40

        ⑤learning rate: 0.001

(2)Baseline Methods

        ①Similar to typical graph

(3)Results and Analysis

        ①Performance table:

2.5.5. Ablation Study

        ①Module ablation study:

2.5.6. Parameter Sensitivity Analysis

        ①\alpha_{DIFFPOOL} is the number of clusters after soft coarsening in DIFFPOOL and \alpha_{DIFFPOOL} \in \left [ 0.05,3 \right ] with an increment of 0.05

        ②\alpha_{GIN} denotes the number of GNN layers and it varies from 3 to 7

        ③Hyperparameter varing observation:

        ④\lambda _{co}\in\quad\{0.0001,0.0002,0.0005,0.001,0.002,0.005,0.01\} on NCI1:

        ⑤\lambda _{fs}\in\{1,2,5,10,20,50,100,200,500,1000\} on mini-REM12K and mini-MGCD:

2.6. Conclusion

        They want further explore the noisy labels

3. Reference

Xie, Y. et al. (2023) 'Semisupervised Graph Neural Networks for Graph Classification', IEEE Transactions on Cybernetics, 53(10): 6222-6235. doi:  10.1109/TCYB.2022.3164696

