[论文精读]AM-GCN: Adaptive Multi-channel Graph Convolutional Networks

夏莉莉iy

已于 2024-04-09 09:56:36 修改

阅读量1.5k

点赞数 27

分类专栏：论文精读文章标签：人工智能深度学习计算机视觉学习图论笔记机器学习

于 2024-04-06 10:50:26 首次发布

本文链接：https://blog.csdn.net/sherlily/article/details/137352143

版权

论文精读专栏收录该内容

165 篇文章

订阅专栏

本文介绍了AM-GCN，一种针对图卷积网络在融合节点特征和拓扑结构方面的局限进行改进的模型。通过实验对比，论文探讨了GCN在不同情况下对拓扑结构和节点特征依赖性的适应能力，并引入了注意力机制来优化性能。文章还涵盖了StochasticBlockmodel、HSIC等概念的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

论文网址：[2007.02265] AM-GCN: Adaptive Multi-channel Graph Convolutional Networks (arxiv.org)

论文代码：GitHub - zhumeiqiBUPT/AM-GCN： AM-GCN：自适应多通道图卷积网络

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.3. Fusion Capability of GCNs: an Experimental Investigation

2.3.1. Case 1: Random Topology and Correlated Node Features

2.3.2. Case 2: Correlated Topology and Random Node Features

2.4. AM-GCN: the Proposed Model

2.4.1. Specific Convolution Module

2.4.2. Common Convolution Module

2.4.3. Attention Mechanism

2.4.4. Objective Function

2.5. Experiments

2.5.1. Experimental Setup

2.5.2. Node Classification

2.5.3. Analysis of Variants

2.5.4. Visualization

2.5.5. Analysis of Attention Mechanism

2.5.6. Parameter Study

2.6. Related Work

2.7. Conclusion

2.8. Supplement

2.8.1. Experiments Settings

2.8.2. Baselines and Datasets

2.8.3. Implementation Details

2.8.4. Additional Results

3. 知识补充

3.1. Stochastic Blockmodel

3.2. Hilbert-Schmidt Independence Criterion (HSIC)

3.3. Gram matrix

3.4. t-SNE

4. Reference List

1. 省流版

1.1. 心得

（1）case 1和2挺奇怪的，要让GCN去发现“拓扑没有用而节点有用”“节点有用但拓扑没有用”这样的规律而给到一定的权重。其实也是蛮有趣的点吧

（2）很难得看到数学行文这么清晰的，没啥我不懂的

（3）dataset statistics table好评

（4）⭐注意力分布来分析拓扑和特征重要程度从而讨论表现。真是妙啊

（5）好活当赏，值得一品

1.2. 论文总结图

2. 论文逐段精读

2.1. Abstract

①Even the most SOTA GCN fails to present the node feature fusion and topological structure

②They proposed an adaptive multi-channel graph convolutional networks for semi-supervised classification (AM-GCN) to solve this problem

2.2. Introduction

①GCNs do not posses the ability that appropriately fusing node features and topological structure（所以可能才更提倡GAT?）

2.3. Fusion Capability of GCNs: an Experimental Investigation

2.3.1. Case 1: Random Topology and Correlated Node Features

①Generating a network randomly:

Number of node	900
Edge probability of any two nodes	0.03
Eigenvector of nodes (by Guassian distribution)	50 dimension
Label (node classification) (with the same covariance matrix)	3

⭐while the label mainly related to node features rather than topological structure. It can be seen that GCNs are influenced by topological information.（但是我也想问，既然你就节点包含有用信息了，为啥非要连接呢？这种虚构的连接本来就是没有意义的。是你强行去干扰了他。随机建变也太不合理了，现实中的人也不是随机认识啊）

②GNN: GCN

③Traning set: random 20 nodes with each label

④Test set: another random 200 nodes with each label

⑤Performance of GCN/MLP: 75.2%/100%

2.3.2. Case 2: Correlated Topology and Random Node Features

①Generating a network randomly:

Number of node	900
Edge probability of any two nodes in the same community	0.03
Edge probability of any two nodes between different community	0.0015
Eigenvector of nodes (randomly)	50 dimension
Label (node classification/community) (Stochastic Blockmodel (SBM))	3

while the label mainly related to topological structure rather thannode features

②Performance of GCN/DeepWalk: 87%/100%, cuz DeepWalk ignores the node features（⭐其实这里就非常奇怪了，作者把融合节点特征和拓扑结构的GCN和单节点特征分类器及单拓扑分类器做比较，认为GCN不能自适应地极端偏向某一边。虽然说这样是有一点小小的道理的，但总感觉也违背了GCN本来就想融合多重特征的初心）

2.4. AM-GCN: the Proposed Model

①Problem settings: semi-supervised node classification graph $G = (\mathbf{A},\mathbf{X})$ , where

$\mathbf{A}\in\mathbb{R}^{n\times n}$ denotes the symmetric adjacency matrix,

$n$ denotes the number of nodes,

$\mathbf{X}\in\mathbb{R}^{n\times d}$ denotes the node feature matrix,

$d$ denotes the node feature dimension

②The schematic of AM-GCN:

2.4.1. Specific Convolution Module

（1）Feature graph

①Based on node feature matrix $\mathbf{X}$ , the authors construct a k-nearest neighbor (kNN) graph $G_f=\left ( \mathbf{A}_f,\mathbf{X} \right )$

②Calculating similarity matrix $\mathbf{S} \in \mathbb{R}^{n \times n}$ by following two ways ( $\mathbf{x}$ are feature vector):

③Cosine Similarity (their choise):

$\mathbf{S}_{ij}=\frac{\mathbf{x}_i\cdot\mathbf{x}_j}{|\mathbf{x}_i||\mathbf{x}_j|}$

④Heat Kernel:

$\mathbf{S}_{ij}=e^{-\frac{\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}}{t}}$

where the $t$ denotes time parameter in heat conduction equation, they manually set it to 2

⑤The output $\mathbf{Z}_{f}^{(l)}$ in the $l$ -th layer with input $G_f=\left ( \mathbf{A}_f,\mathbf{X} \right )$ can be:

$\mathbf{Z}_{f}^{(l)}=ReLU(\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\tilde{\mathbf{A}}_{f}\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\mathbf{Z}_{f}^{(l-1)}\mathbf{W}_{f}^{(l)})$

and "as we all know", $\mathbf{Z}_{f}^{(0)}=\mathbf{X}$ . Moreover, the final output is $\mathbf{Z}_{F}$ which is regarded the feature graph

（2）Topology graph

①Similar to structure graph, but $\mathbf{A}_t=\mathbf{A}$ (no screen)

2.4.2. Common Convolution Module

①The topological and structural information is not completely independent. Accordingly, it is neccessary to combine them in Common-GCN

②They just share the weights...

$\mathbf{Z}_{ct}^{(l)}=ReLU(\tilde{\mathbf{D}}_{t}^{-\frac{1}{2}}\tilde{\mathbf{A}}_{t}\tilde{\mathbf{D}}_{t}^{-\frac{1}{2}}\mathbf{Z}_{ct}^{(l-1)}\mathbf{W}_{c}^{(l)})$

$\mathbf{Z}_{cf}^{(l)}=ReLU(\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\tilde{\mathbf{A}}_{f}\tilde{\mathbf{D}}_{f}^{-\frac{1}{2}}\mathbf{Z}_{cf}^{(l-1)}\mathbf{W}_{c}^{(l)})$

③Then combining the two outputs $\mathbf{Z}_{CT}$ and $\mathbf{Z}_{CF}$ to common embedding $\mathbf{Z}_{C}$ :

$\mathbf{Z}_{C}=(\mathbf{Z}_{CT}+\mathbf{Z}_{CF})/2$

2.4.3. Attention Mechanism

①The attention value:

$(\boldsymbol{\alpha}_{t},\boldsymbol{\alpha}_{c},\boldsymbol{\alpha}_{f})=att(\mathbf{Z}_{T},\mathbf{Z}_{C},\mathbf{Z}_{F})$

where each $\boldsymbol{\alpha} \in\mathbb{R}^{n \times 1}$

②For the $i$ -th row of $\mathbf{Z}_{T}$ , namely the $i$ -th node $\mathbf{z}_{T}^{i}\in \mathbb{R}^{1\times h}$ , transforming it nonlinearly. Then get its attention value by shared attention vector $\mathbf{q}\in\mathbb{R}^{h^{\prime}\times1}$ :

$\omega_{T}^{i}=\mathbf{q}^{T}\cdot tanh(\mathbf{W}\cdot\left(\mathbf{z}_{T}^{i}\right)^{T}+\mathbf{b})$

the same as $\omega_{C}^{i}$ and $\omega_{F}^{i}$

③Normalization:

$\alpha_T^i=softmax(\omega_T^i)=\frac{exp(\omega_T^i)}{exp(\omega_T^i)+exp(\omega_C^i)+exp(\omega_F^i)}$

the same as $\alpha_{C}^{i}$ and $\alpha_{F}^{i}$

④Denoting $\boldsymbol{\alpha}_{T} = diag(\boldsymbol{\alpha}_{t}), \boldsymbol{\alpha}_{C} = diag(\boldsymbol{\alpha}_{c}), \boldsymbol{\alpha}_{F} = diag(\boldsymbol{\alpha}_{f})$ , the final embedding will be:

$\mathbf{Z}=\boldsymbol{\alpha}_T\cdot\mathbf{Z}_T+\boldsymbol{\alpha}_C\cdot\mathbf{Z}_C+\boldsymbol{\alpha}_F\cdot\mathbf{Z}_F$

2.4.4. Objective Function

（1）Consistency Constraint

①To normalize $\mathbf{Z}_{CT}$ and $\mathbf{Z}_{CF}$ to $\mathbf{Z}_{CTnor}$ and $\mathbf{Z}_{CFnor}$

②The similarity matrix can be:

$\mathbf{S}_{T}=\mathbf{Z}_{CTnor}\cdot\mathbf{Z}_{CTnor}^{T}\\\mathbf{S}_{F}=\mathbf{Z}_{CFnor}\cdot\mathbf{Z}_{CFnor}^{T}$

③For two similar matrices, the loss/constraint can be:

$\mathcal{L}_{c}=\|\mathbf{S}_{T}-\mathbf{S}_{F}\|_{F}^{2}$

（2）Disparity Constraint

①They apply Hilbert-Schmidt Independence Criterion (HSIC) to calculate the independent loss:

$HSIC(\mathbf{Z}_T,\mathbf{Z}_{CT})=(n-1)^{-2}tr(\mathbf{R}\mathbf{K}_T\mathbf{R}\mathbf{K}_{CT})$

where both $\mathbf{K}_T$ and $\mathbf{K}_{CT}$ are Gram matrices, which $k_{T,ij}=k_{T}(\mathbf{z}_{T}^{i},\mathbf{z}_{T}^{j})$ and $k_{CT,ij}=k_{CT}(\mathbf{z}_{CT}^{i},\mathbf{z}_{CT}^{j})$ ;

$\mathbf{R}=\mathbf{I}-{\frac{1}{n}}ee^{T}$ , $\mathbf{I}$ denotes identity matrix and $e$ denotes a column vector with all-one

②The same as:

$HSIC(\mathbf{Z}_{F},\mathbf{Z}_{CF})=(n-1)^{-2}tr(\mathbf{R}\mathbf{K}_{F}\mathbf{R}\mathbf{K}_{CF})$

③Combining two loss together:

$\mathcal{L}_{d}=HSIC(\mathbf{Z}_{T},\mathbf{Z}_{CT})+HSIC(\mathbf{Z}_{F},\mathbf{Z}_{CF})$

（3）Optimization Objective

①The predictions are:

$\hat{\mathbf{Y}}=[\hat{y}_{ic}]\in\mathbb{R}^{n\times C}=softmax(\mathbf{W}\cdot\mathbf{Z}+\mathbf{b})$

②Cross entropy loss between ground truth $\mathbf{Y}_{l}$ and predicted $\hat{\mathbf{Y}}$ :

$\mathcal{L}_{t}=-\sum_{l\in L}\sum_{i=1}^{C}\mathbf{Y}_{li}\mathrm{ln}\hat{\mathbf{Y}}_{li}$

③The total loss:

$\mathcal{L}=\mathcal{L}_{t}+\gamma\mathcal{L}_{c}+\beta\mathcal{L}_{d}$

2.5. Experiments

2.5.1. Experimental Setup

（1）Datasets

①The statistics of their datasets:

（2）Baselines

①Network embedding algorithms: DeepWalk andLINE

②Graph neural network: Chebyshev, GCN, kNN-GCN, GAT, DEMO-Net, and MixHop

（3）Parameters Setting

①Label rate for semi-supervised learning: 20, 40, 60 per class

②Training 3 2-layer GCNs with the same hidden layer dimension ( $nhid1 \in \left \{ 512,768 \right \}$ ) and the same output dimension ( $nhid2 \in \left \{ 32, 128,256 \right \}$ )

③Learning rate: 0.0001-0.0005

④Optimizer: Adam

⑤Dropout rate: 0.5

⑥Weight decay: 5e-3 or 5e-4

⑦k of kNN: 2,3,...,10

⑧Searching paramaters of $\mathcal{L}_{c}\in \left \{ 0.01,0.001,0.0001 \right \}$ and $\mathcal{L}_{d}\in \left \{ 12-10,5e-9,1e-9,5e-8,1e-8 \right \}$

⑨Running times: 5 for average

⑩Metrices: ACC and F1

2.5.2. Node Classification

①Performance comparison:

where L/C denotes the number of labeled nodes per class

2.5.3. Analysis of Variants

①The ablation study:

AM-GCN-w/o	AM-GCN without $\mathcal{L}_{c}$ and $\mathcal{L}_{d}$
AM-GCN-c	AM-GCN with $\mathcal{L}_{c}$
AM-GCN-d	AM-GCN with $\mathcal{L}_{d}$

②The results table:

2.5.4. Visualization

①They visualize the last layer of AM-GCN/GCN/GAT/DeepWalk on BlogCatalog by t-SNE:

2.5.5. Analysis of Attention Mechanism

（1）Analysis of attention distributions

①Attention distributions under 20 label rate:

（2）Analysis of attention trends

①Attention trends with x-axis denoting epoch and y-axis denoting average attention value under 20 label rate

2.5.6. Parameter Study

（1）Analysis of consistency coefficient γ

①Varying test of γ:

（2）Analysis of disparity constraint coefficient β

①Varying test of β:

（3）Analysis of k-nearest neighbor graph k

①Varying test of k:

2.6. Related Work

①？差点怀疑看错了，怎么放到这来了

②Listing some related works one by one

2.7. Conclusion

Fine

2.8. Supplement

2.8.1. Experiments Settings

=。=介绍了一下电脑配置

2.8.2. Baselines and Datasets

①Code:

DeepWalk, LINE	https://github.com/thunlp/OpenNE
Chebyshev	https://github.com/tkipf /gcn
GCN (Pytorch)	https://github.com/tkipf /pygcn
GAT (Pytorch)	https://github.com/Diego999/pyGAT/
DEMO-Net	https://github.com/jwu4sml/DEMO-Net
MixHop	https://github.com/samihaija/mixhop

②Dataset:

Citeseer	https://github.com/tkipf /pygcn
UAI2010	http://linqs.umiacs.umd.edu/projects//projects/lb c/index.html
ACM	https://github.com/Jhy1993/HAN
BlogCatalog	https://github.com/mengzaiqiao/CAN
Flickr	https://github.com/mengzaiqiao/CAN
CoraFull	https://github.com/abojchevski/graph2gauss/

2.8.3. Implementation Details

啥也没说

2.8.4. Additional Results

（1）Analysis of attention trends

①Attention trends:

②Varying test of γ:

③Varying test of β:

④Varying test of k:

（2）Parameters Study

3. 知识补充

3.1. Stochastic Blockmodel

参考学习1：图生成模型之随机块模型（stochastic block model）学习笔记 - 知乎 (zhihu.com)

参考学习2：数据包络分析（超效率-SBM模型）附python代码_超效率sbm模型-CSDN博客

3.2. Hilbert-Schmidt Independence Criterion (HSIC)

（1）简介：Hilbert-Schmidt Independence Criterion（HSIC）是一种用于衡量两个变量之间独立性的准则。它与互信息类似，通过度量两个变量之间的分布差异来评估它们的独立性。HSIC基于协方差构建，可以视为协方差的一种扩展，用于更广泛地描述变量之间的关系。

（2）参考学习：HSIC简介：一个有意思的判断相关性的思路-CSDN博客

3.3. Gram matrix

（1）简介：Gram matrix（格拉姆矩阵）是一个在内积空间中，由一组向量的内积组成的矩阵。具体来说，对于一组向量[v1, v2, … , vn]，格拉姆矩阵是一个埃尔米特矩阵，其元素Gij由向量vi和vj的内积给出，即Gij = 〈vi, vj〉。在图像处理中，对于给定的矩阵A，其列向量的格拉姆矩阵是ATA，而行向量的格拉姆矩阵是AAT。

（2）参考学习：格拉姆矩阵（Gram matrix）详细解读 - 知乎 (zhihu.com)