[论文精读]AM-GCN: Adaptive Multi-channel Graph Convolutional Networks

论文网址:[2007.02265] AM-GCN: Adaptive Multi-channel Graph Convolutional Networks (arxiv.org)

论文代码:GitHub - zhumeiqiBUPT/AM-GCN: AM-GCN: 自适应多通道图卷积网络

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 省流版

1.1. 心得

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Fusion Capability of GCNs: an Experimental Investigation

2.3.1. Case 1: Random Topology and Correlated Node Features

2.3.2. Case 2: Correlated Topology and Random Node Features

2.4. AM-GCN: the Proposed Model

2.4.1. Specific Convolution Module

2.4.2. Common Convolution Module

2.4.3. Attention Mechanism

2.4.4. Objective Function

2.5. Experiments

2.5.1. Experimental Setup

2.5.2. Node Classification

2.5.3. Analysis of Variants

2.5.4. Visualization

2.5.5. Analysis of Attention Mechanism

2.5.6. Parameter Study

2.6. Related Work

2.7. Conclusion

2.8. Supplement

2.8.1.  Experiments Settings

2.8.2.  Baselines and Datasets

2.8.3.  Implementation Details

2.8.4. Additional Results

3. 知识补充

3.1. Stochastic Blockmodel

3.2. Hilbert-Schmidt Independence Criterion (HSIC) 

3.3. Gram matrix

3.4.  t-SNE

4. Reference List


1. 省流版

1.1. 心得

(1)case 1和2挺奇怪的,要让GCN去发现“拓扑没有用而节点有用”“节点有用但拓扑没有用”这样的规律而给到一定的权重。其实也是蛮有趣的点吧

(2)很难得看到数学行文这么清晰的,没啥我不懂的

(3)dataset statistics table好评

(4)⭐注意力分布来分析拓扑和特征重要程度从而讨论表现。真是妙啊

(5)好活当赏,值得一品

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

        ①Even the most SOTA GCN fails to present the node feature fusion and topological structure

        ②They proposed an adaptive multi-channel graph convolutional networks for semi-supervised classification (AM-GCN) to solve this problem

2.2. Introduction

        ①GCNs do not posses the ability that appropriately fusing node features and topological structure(所以可能才更提倡GAT?)

2.3. Fusion Capability of GCNs: an Experimental Investigation

2.3.1. Case 1: Random Topology and Correlated Node Features

        ①Generating a network randomly: 

Number of node900
Edge probability of any two nodes0.03
Eigenvector of nodes (by Guassian distribution)50 dimension
Label (node classification) (with the same covariance matrix)3

⭐while the label mainly related to node features rather than topological structure. It can be seen that GCNs are influenced by topological information.(但是我也想问,既然你就节点包含有用信息了,为啥非要连接呢?这种虚构的连接本来就是没有意义的。是你强行去干扰了他。随机建变也太不合理了,现实中的人也不是随机认识啊

        ②GNN: GCN

        ③Traning set: random 20 nodes with each label 

        ④Test set: another random 200 nodes with each label 

        ⑤Performance of GCN/MLP: 75.2%/100%

2.3.2. Case 2: Correlated Topology and Random Node Features

        ①Generating a network randomly: 

Number of node900
Edge probability of any two nodes in the same community0.03
Edge probability of any two nodes between different community0.0015
Eigenvector of nodes (randomly)50 dimension
Label (node classification/community) (Stochastic Blockmodel (SBM))3

while the label mainly related to topological structure rather thannode features

        ②Performance of GCN/DeepWalk: 87%/100%, cuz DeepWalk ignores the node features(⭐其实这里就非常奇怪了,作者把融合节点特征和拓扑结构的GCN和单节点特征分类器及单拓扑分类器做比较,认为GCN不能自适应地极端偏向某一边。虽然说这样是有一点小小的道理的,但总感觉也违背了GCN本来就想融合多重特征的初心

2.4. AM-GCN: the Proposed Model

        ①Problem settings: semi-supervised node classification graph G = (\mathbf{A},\mathbf{X}), where

\mathbf{A}\in\mathbb{R}^{n\times n} denotes the symmetric adjacency matrix,

n denotes the number of nodes,

\mathbf{X}\in\mathbb{R}^{n\times d} denotes the node feature matrix,

d denotes the node feature dimension

        ②The schematic of AM-GCN:

2.4.1. Specific Convolution Module

(1)Feature graph

        ①Based on node feature matrix \mathbf{X}, the authors construct a k-nearest neighbor (kNN) graph G_f=\left ( \mathbf{A}_f,\mathbf{X} \right )

        ②Calculating similarity matrix \mathbf{S} \in \mathbb{R}^{n \times n} by following two ways (\mathbf{x} are feature vector):

        ③Cosine Similarity (their choise):

\mathbf{S}_{ij}=\frac{\mathbf{x}_i\cdot\mathbf{x}_j}{|\mathbf{x}_i||\mathbf{x}_j|}

        ④Heat Kernel:

\mathbf{S}_{ij}=e^{-\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}}{t}}

where the t denotes time parameter in heat conduction equation, they manually set it to 2

        ⑤The output \mathbf{Z}_{f}^{(l)} in the l-th layer with input G_f=\left ( \mathbf{A}_f,\mathbf{X} \right ) can be:

\mathbf{Z}_{f}^{(l)}=ReLU(\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\tilde{\mathbf{A}}_{f}\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\mathbf{Z}_{f}^{(l-1)}\mathbf{W}_{f}^{(l)})

and "as we all know", \mathbf{Z}_{f}^{(0)}=\mathbf{X}. Moreover, the final output is \mathbf{Z}_{F} which is regarded the feature graph

(2)Topology graph

        ①Similar to structure graph, but \mathbf{A}_t=\mathbf{A} (no screen)

2.4.2. Common Convolution Module

        ①The topological and structural information is not completely independent. Accordingly, it is neccessary to combine them in Common-GCN

        ②They just share the weights...

\mathbf{Z}_{ct}^{(l)}=ReLU(\tilde{\mathbf{D}}_{t}^{-\frac{1}{2}}\tilde{\mathbf{A}}_{t}\tilde{\mathbf{D}}_{t}^{-\frac{1}{2}}\mathbf{Z}_{ct}^{(l-1)}\mathbf{W}_{c}^{(l)})

\mathbf{Z}_{cf}^{(l)}=ReLU(\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\tilde{\mathbf{A}}_{f}\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\mathbf{Z}_{cf}^{(l-1)}\mathbf{W}_{c}^{(l)})

        ③Then combining the two outputs \mathbf{Z}_{CT} and \mathbf{Z}_{CF} to common embedding \mathbf{Z}_{C}:

\mathbf{Z}_{C}=(\mathbf{Z}_{CT}+\mathbf{Z}_{CF})/2

2.4.3. Attention Mechanism

        ①The attention value:

(\boldsymbol{\alpha}_{t},\boldsymbol{\alpha}_{c},\boldsymbol{\alpha}_{f})=att(\mathbf{Z}_{T},\mathbf{Z}_{C},\mathbf{Z}_{F})

where each \boldsymbol{\alpha} \in\mathbb{R}^{n \times 1}

        ②For the i-th row of \mathbf{Z}_{T}, namely the i-th node \mathbf{z}_{T}^{i}\in \mathbb{R}^{1\times h}, transforming it nonlinearly. Then get its attention value by shared attention vector \mathbf{q}\in\mathbb{R}^{h^{\prime}\times1}:

\omega_{T}^{i}=\mathbf{q}^{T}\cdot tanh(\mathbf{W}\cdot\left(\mathbf{z}_{T}^{i}\right)^{T}+\mathbf{b})

the same as \omega_{C}^{i} and \omega_{F}^{i}

        ③Normalization:

\alpha_T^i=softmax(\omega_T^i)=\frac{exp(\omega_T^i)}{exp(\omega_T^i)+exp(\omega_C^i)+exp(\omega_F^i)}

the same as \alpha_{C}^{i} and \alpha_{F}^{i}

        ④Denoting \boldsymbol{\alpha}_{T} = diag(\boldsymbol{\alpha}_{t}), \boldsymbol{\alpha}_{C} = diag(\boldsymbol{\alpha}_{c}), \boldsymbol{\alpha}_{F} = diag(\boldsymbol{\alpha}_{f}), the final embedding will be:

\mathbf{Z}=\boldsymbol{\alpha}_T\cdot\mathbf{Z}_T+\boldsymbol{\alpha}_C\cdot\mathbf{Z}_C+\boldsymbol{\alpha}_F\cdot\mathbf{Z}_F

2.4.4. Objective Function

(1)Consistency Constraint

        ①To normalize \mathbf{Z}_{CT} and \mathbf{Z}_{CF} to \mathbf{Z}_{CTnor} and \mathbf{Z}_{CFnor}

        ②The similarity matrix can be:

\mathbf{S}_{T}=\mathbf{Z}_{CTnor}\cdot\mathbf{Z}_{CTnor}^{T}\\\mathbf{S}_{F}=\mathbf{Z}_{CFnor}\cdot\mathbf{Z}_{CFnor}^{T}

        ③For two similar matrices, the loss/constraint can be:

\mathcal{L}_{c}=\|\mathbf{S}_{T}-\mathbf{S}_{F}\|_{F}^{2}

(2)Disparity Constraint

        ①They apply Hilbert-Schmidt Independence Criterion (HSIC) to calculate the independent loss:

HSIC(\mathbf{Z}_T,\mathbf{Z}_{CT})=(n-1)^{-2}tr(\mathbf{R}\mathbf{K}_T\mathbf{R}\mathbf{K}_{CT})

where both \mathbf{K}_T and \mathbf{K}_{CT} are Gram matrices, which k_{T,ij}=k_{T}(\mathbf{z}_{T}^{i},\mathbf{z}_{T}^{j}) and k_{CT,ij}=k_{CT}(\mathbf{z}_{CT}^{i},\mathbf{z}_{CT}^{j});

\mathbf{R}=\mathbf{I}-{\frac{1}{n}}ee^{T}\mathbf{I} denotes identity matrix and e denotes a column vector with all-one

        ②The same as:

HSIC(\mathbf{Z}_{F},\mathbf{Z}_{CF})=(n-1)^{-2}tr(\mathbf{R}\mathbf{K}_{F}\mathbf{R}\mathbf{K}_{CF})

        ③Combining two loss together:

\mathcal{L}_{d}=HSIC(\mathbf{Z}_{T},\mathbf{Z}_{CT})+HSIC(\mathbf{Z}_{F},\mathbf{Z}_{CF})

(3)Optimization Objective

        ①The predictions are: 

\hat{\mathbf{Y}}=[\hat{y}_{ic}]\in\mathbb{R}^{n\times C}=softmax(\mathbf{W}\cdot\mathbf{Z}+\mathbf{b})

        ②Cross entropy loss between ground truth \mathbf{Y}_{l} and predicted \hat{\mathbf{Y}}:

\mathcal{L}_{t}=-\sum_{l\in L}\sum_{i=1}^{C}\mathbf{Y}_{li}\mathrm{ln}\hat{\mathbf{Y}}_{li}

        ③The total loss:

\mathcal{L}=\mathcal{L}_{t}+\gamma\mathcal{L}_{c}+\beta\mathcal{L}_{d}

2.5. Experiments

2.5.1. Experimental Setup

(1)Datasets

        ①The statistics of their datasets:

(2)Baselines

        ①Network embedding algorithms: DeepWalk andLINE

        ②Graph neural network: Chebyshev, GCN, kNN-GCN, GAT, DEMO-Net, and MixHop

(3)Parameters Setting

        ①Label rate for semi-supervised learning: 20, 40, 60 per class

        ②Training 3 2-layer GCNs with the same hidden layer dimension (nhid1 \in \left \{ 512,768 \right \}) and the same output dimension (nhid2 \in \left \{ 32, 128,256 \right \})

        ③Learning rate: 0.0001-0.0005

        ④Optimizer: Adam

        ⑤Dropout rate: 0.5

        ⑥Weight decay: 5e-3 or 5e-4

        ⑦k of kNN: 2,3,...,10

        ⑧Searching paramaters of \mathcal{L}_{c}\in \left \{ 0.01,0.001,0.0001 \right \} and \mathcal{L}_{d}\in \left \{ 12-10,5e-9,1e-9,5e-8,1e-8 \right \}

        ⑨Running times: 5 for average

        ⑩Metrices: ACC and F1

2.5.2. Node Classification

        ①Performance comparison:

where L/C denotes the number of labeled nodes per class

2.5.3. Analysis of Variants

        ①The ablation study:

AM-GCN-w/oAM-GCN without \mathcal{L}_{c} and \mathcal{L}_{d}
AM-GCN-cAM-GCN with \mathcal{L}_{c}
AM-GCN-dAM-GCN with \mathcal{L}_{d}

        ②The results table:

2.5.4. Visualization

        ①They visualize the last layer of AM-GCN/GCN/GAT/DeepWalk on BlogCatalog by t-SNE:

2.5.5. Analysis of Attention Mechanism

(1)Analysis of attention distributions

        ①Attention distributions under 20 label rate:

(2)Analysis of attention trends

        ①Attention trends with x-axis denoting epoch and y-axis denoting average attention value under 20 label rate

2.5.6. Parameter Study

(1)Analysis of consistency coefficient γ

        ①Varying test of γ:

(2)Analysis of disparity constraint coefficient β

        ①Varying test of β:

(3)Analysis of k-nearest neighbor graph k

        ①Varying test of k:

2.6. Related Work

        ①?差点怀疑看错了,怎么放到这来了

        ②Listing some related works one by one

2.7. Conclusion

        Fine

2.8. Supplement

2.8.1.  Experiments Settings

        =。=介绍了一下电脑配置

2.8.2.  Baselines and Datasets

        ①Code:

DeepWalk, LINEhttps://github.com/thunlp/OpenNE
Chebyshevhttps://github.com/tkipf /gcn
GCN (Pytorch)https://github.com/tkipf /pygcn
GAT (Pytorch)https://github.com/Diego999/pyGAT/
DEMO-Nethttps://github.com/jwu4sml/DEMO-Net
MixHophttps://github.com/samihaija/mixhop

        ②Dataset:

Citeseerhttps://github.com/tkipf /pygcn
UAI2010http://linqs.umiacs.umd.edu/projects//projects/lb
c/index.html
ACMhttps://github.com/Jhy1993/HAN
BlogCataloghttps://github.com/mengzaiqiao/CAN
Flickrhttps://github.com/mengzaiqiao/CAN
CoraFullhttps://github.com/abojchevski/graph2gauss/

2.8.3.  Implementation Details

        啥也没说

2.8.4. Additional Results

(1)Analysis of attention trends

        ①Attention trends:

        ②Varying test of γ:

        ③Varying test of β:

        ④Varying test of k:

(2)Parameters Study

3. 知识补充

3.1. Stochastic Blockmodel

参考学习1:图生成模型之随机块模型(stochastic block model)学习笔记 - 知乎 (zhihu.com)

参考学习2:数据包络分析(超效率-SBM模型)附python代码_超效率sbm模型-CSDN博客

3.2. Hilbert-Schmidt Independence Criterion (HSIC) 

(1)简介:Hilbert-Schmidt Independence Criterion(HSIC)是一种用于衡量两个变量之间独立性的准则。它与互信息类似,通过度量两个变量之间的分布差异来评估它们的独立性。HSIC基于协方差构建,可以视为协方差的一种扩展,用于更广泛地描述变量之间的关系。

(2)参考学习:HSIC简介:一个有意思的判断相关性的思路-CSDN博客

3.3. Gram matrix

(1)简介:Gram matrix(格拉姆矩阵)是一个在内积空间中,由一组向量的内积组成的矩阵。具体来说,对于一组向量[v1, v2, … , vn],格拉姆矩阵是一个埃尔米特矩阵,其元素Gij由向量vi和vj的内积给出,即Gij = 〈vi, vj〉。在图像处理中,对于给定的矩阵A,其列向量的格拉姆矩阵是ATA,而行向量的格拉姆矩阵是AAT。

(2)参考学习:格拉姆矩阵(Gram matrix)详细解读 - 知乎 (zhihu.com)

3.4.  t-SNE

参考学习:t-SNE 原理及Python实例 - 知乎 (zhihu.com)

4. Reference List

Wang, X. et al. (2020) 'AM-GCN: Adaptive Multi-channel Graph Convolutional Networks'KDD2020. doi: https://doi.org/10.48550/arXiv.2007.02265

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值