[论文精读]Hi-GCN: A hierarchical graph convolution network for graph embedding learning of brain network

夏莉莉iy

已于 2023-12-08 23:17:16 修改

阅读量1.7k

点赞数 19

分类专栏：论文精读文章标签： embedding 人工智能深度学习学习机器学习图像处理图论

于 2023-12-06 00:45:56 首次发布

本文链接：https://blog.csdn.net/Sherlily/article/details/134817957

版权

论文精读专栏收录该内容

102 篇文章 10 订阅

订阅专栏

论文原文：Hi-GCN: A hierarchical graph convolution network for graph embedding learning of brain network and brain disorders prediction - ScienceDirect

论文代码：https://github.com/haojiang1/hi-GCN

2.3.2. Functional connectivity network

2.3.3. Graph convolutional networks

2.4. Hierarchical GCN

2.4.1. The network architecture of hierarchical GCN

2.4.2. F-GCN

2.4.3. p-GCN

2.4.4. Training scheme

2.5. Experiment

2.5.1. Databases and preprocessing

2.5.2. Performance on hierarchical GCN

2.5.3. Performance on different construction in population network

2.5.4. The influence of the hyperparameters of Hi-GCN

2.5.5. Comparisons with prior works

2.5.6. Ablation study and discussion

3.3. Ridge Classifier

4. Reference List

1. 省流版

1.1. 心得

（1）它那个图核蛮奇妙的，之后细致了解一下

1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

2.1.1. Purpose

Low dimensional brain connectivity networks prevail in detecting and predicting diseases using brain structures.

2.1.2. Method

They put forward a end to end hierarchical GCN framework (hi-GCN), while taking topology structure and relationships between subjects into consideration.

2.1.3. Results

①They choose Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and Autism Brain Imaging Data Exchange (ABIDE) dataset.

②The accuracy they achieve in ABIDE is 73.1% and in ADNI is 78.5%.

③The AUC they get in ABIDE is 82.3% and in ADNI is 86.5%.

2.1.4. Conclusion

①They consider the correlation between individual brain structure and global network

②The authors reckon jointly optimizing strategy is faster and easier convergence than pre-trained or two-step supervised hi-GCN.

2.2. Introduction

①Resting state functional MRI (rs-fMRI) presents a default state of brain. And there are change of cortical connectivity of Autism spectrum disorder (ASD) or Brain disorders such as Alzheimer's disease (AD) subjects

②Appropriate feature extraction is extremely essential, methods as clustering coefficients, local clustering coefficients and deep convolutional nerual network are all useful. （但是作者在这里说手工计算的是中级特征或低级特征，没有DNN的高级特征好。emm，以什么来分级的呢？）

③Network embedding (functional connectivity) only consider individual, but without considering the relationship between subjects

④They build another framework to connect subjects where nodes represent subjects with features and edges represent associations (calculated by pairwise similarities between nodes)

⑤They think the challenge is how to combine individual and global. Then they designed f-GCN for local network and p-GCN for population network

2.3. Preliminaries

2.3.1. Problem setup

①They build undirected graph $N_{i}=\left \{ R_{i},A_{i} \right \}$ for each one, where $R_{i}=\left \{ r^{1}_{i},...,r^{M}_{i} \right \}$ represents each node, $A_{i}\in R^{M\times M}$ denotes adjacency matrix, $M$ is the number of ROI (they use 116)

②⭐作者说“每个顶点的嵌入 $R$ 是在GCN训练期间学习的，因此 $R_i$ 设置为1”是什么东西， $R_i$ 不是向量吗？

③Finally they get a group of graphs $\left \{ N_1,...,N_D \right \}$

④In the global population network, they define $\widehat{N}=\left \{ \widehat{R},\widehat{A} \right \}$ , where $\widehat{A}$ denotes pairwise similarities adjacency matrix between subjects,

2.3.2. Functional connectivity network

①⭐They extract time series and then normalize them into zero mean and unit variance

②Calculating Pearson's correlation (PC) between ROIs:

$Q\left(r_i,r_j\right)=\frac{Cov(v_i,v_j)}{\sigma_{v_i}\sigma_{v_j}}$

where $Cov(v_i,v_j)$ denotes the cross covariance between $v_i$ and $v_j$ ;

$\sigma _v$ represents the the standard deviation of node $v$ .

③Network construction:

2.3.3. Graph convolutional networks

①GCN is not able to be generalized in irregular graph, so they adopt spectral approach in their graph

②Each layer in GCN is:

$E^{(l+1)}=\mathbf{ReLu}\left ( \widetilde{D}^{-\frac{1}{2}} \widehat{A} \widetilde{D}^{-\frac{1}{2}}E^{(l)}W^{(l)} \right )$

where $\widehat{A}=A+\mathbf{I}_{n}$ , $\widetilde{D}_{ii}=\sum_{j}\widehat{A}_{ij}$ ;

$W$ denotes learnable weight matrix;

$E^{(l+1)}$ denotes node embedding from preceding $l$ layers.

③GCN can be seen as Laplacian smoothing operator, each convolutional layer follows a ReLU activation and all the layers share the same adjacency matrix

④To reduce computational complexity, they approximate convolutional kernels by Chebyshev Polynomials

⑤They apply K-order neighborhood to convolution

2.4. Hierarchical GCN

2.4.1. The network architecture of hierarchical GCN

（1）The components of hi-GCN

①f-GCN: representation for each subject

②Network similarity estimation in $\widehat{N}$ : they use graph kernel to calculate network similarity between different subjects

③p-GCN: combining f-GCN and graph kernel

④Backpropagation are adopted

（2）The overall framework:

（3）Feature embedding

①The figure of embedding:

②The broad change:

$\mathbf{hi}-\mathbf{GCN}:\boldsymbol{N}\to\widehat{\boldsymbol{e}}$

where it is consisted of $\mathbf{f}-\mathbf{GCN}\left(\boldsymbol{N}\right)=\boldsymbol{e};\quad\mathbf{p}-\mathbf{GCN}\left(\boldsymbol{e},\widehat{A}\right)=\left[\widehat{\boldsymbol{e}},\hat{y}\right]$

2.4.2. F-GCN

①There are coarsened graphs by average pooling after convolutions

②They classify nodes in subnetworks by spectral clustering

③ $N_{coar}$ represents coarsened graph, $C\in \mathbb{R}^{M\times N}$ denotes assignment matrix which judges belonging relationship between nodes and subnetworks and where $M$ is the number of ROI

④ $A_{coar}$ represents the adjacency matrix of coarsened subnetworks and $A_{ext}$ represents the adjacency matrix of edges in subnetworks. And,

$A_{coar}=\sum_{h=1}^{H}C^{(h)}A_{ext}\left(\boldsymbol{C}^{(h)}\right)^T$

⑤In aggregating nodes' features and obtaining the representation s of supernodes, they apply:

$E_c=\Theta_c^TE$

where $\Theta _c$ is pooling operator, which consists of "all the c-th up-sampled eigenvectors from all the subnetworks";

$E_c\in\mathbb{R}^{H\times P_l}$ denotes the pooling results;

$P_l$ denotes the $l$ -th embedding dimensionality.

⑥The authors only apply fisrt $Z$ as the final results, i.e. $E_coar=[E_0,...,E_Z]$

2.4.3. p-GCN

①They evaluate topological similarities between functioanl connectivities by graph kernel

②The define graph kernels between networks $N_i$ and $N_j$ are:

$S(N_i,N_j)=\left \langle \varphi (N_i),\varphi (N_j) \right \rangle$

where $\varphi (\cdot )$ denotes kernel function

③They calculated the distances between instances in the $q$ -th kernel function and represented it by $K_q(r^i_a,r^i_b)$ where $r^i_a=\sum_{M}^{u}A_i(a,u)$ . If adopting RBF kernel function, the distance function will be $K_q(r^i_a,r^i_b)=exp(-\frac{\left \| r^i_a-r^i_b \right \|}{2\sigma })$ where $\sigma$ is kernel parameter.

④Moreover, they set threshold $T$ for distance, $K=1$ when distance is smaller than $T$ , otherwise $K=0$

⑤Similarities between networks are:

$S_I(N_i,N_j)=\frac{\sum_{a=1}^{M}\sum_{b=1}^{M}w^i_aw^j_bK(x^i_a,x^j_b)}{\sum_{M}^{n_i}w^i_a\sum_{M}^{b=1}w^j_b}$

where $w^i_a=\frac{1}{\sum_{u=1}^{M}K(x^i_a,x^i_u)}$

2.4.4. Training scheme

（1）They provide 3 strategies to maximize optimization

①Two step training, which means the loss functions are independent

②Jointly training: both f-GCN and p-GCN share the same loss function:

③Pre training: also two steps but a little bit more complicated:

2.5. Experiment

2.5.1. Databases and preprocessing

①Task: binary classification in two large datasets, ABIDE and ADNI

（1）ABIDE I

①Samples: they choose 866 of them while 402 are ASD and 464 are healthy controls

②Reason: same as 基于功能连接组的静息态fMRI预测模型基准测试 - ScienceDirect

（2）ADNI

①Samples: 133 in all, 99 with Mild Cognitive Impairment (MCI) and 34 with Alzheimer's Disease (AD)

②Reason: see at 2.5.1. (1) ②, the same article

2.5.2. Performance on hierarchical GCN

（1）Parameter setting

①Common parameter settings:

②Hyper-parameter settings:

$T\in\left \{ 0.3,0.45,0.6,0.47,0.9 \right \}$ ;

$\gamma \in\left \{ 1,2,3,4,5 \right \}$ ;

$H\in\left [ 0,1 \right ]$

③Validation: 10-fold cross validation for evaluating hi-GCN, 10-fold nested cross validation for selecting best parameters, nested 5-fold CV for adjusting hyper-parameters

④Applying Student's t-test (significance level = 0.05) to test difference between hi-GCN and other models

（2）Comparison

①They compared hi-GCN with connectivity features based method, Eigenpooling GCN and Population GCN

②Ridge classifier is able to select network based features, but they are low level features

③Eigenpooling GCN is "an end-to-end trainable GCN with a pooling operator EigenPooling"

④Node features in Population GCN are automatically learned, and edges in it are similarities between nodes

⑤BrainNetCNN is combined with edge to edge, edge to node and node to node filter

⑥Additionally, they compared hi-GCN with 4 SOTA models, topological Clustering Coefficient (CC) and t-BNE, subgraph based Graph Boosting and Ordinal Pattern. Then, they introduce all of them

⑦Performances in ABIDE I dataset (mean value in 10 experiments):

⑧Performance in ADNI dataset

⑨Through the two graphs, the authors indicate sub-graph based methods is better than topology based to some extend.

⑩They compared the convergence speed of 3 methods:

2.5.3. Performance on different construction in population network

（1）The effect of similarity estimation scheme with auxiliary information

①They add non-image data, such as gender and acquisition site for ABIDE, age and gender for ADNI

②Similarity of non-image data is calculated by:

$\left.S_{NI}\left(M_i,M_j\right)=\left\{\begin{array}{cc}1&if\left|M_i-M_j\right|<\overline{T}\\0&otherwise\end{array}\right.\right.$

where $\overline{T}$ is threshold value and is set by 2 in their model

③The integrated similarity degree will be:

$S=\alpha S_I+(1-\alpha )S_{NI}$

they control the similarity weights through $\alpha$ . However, in their model, they just set $\alpha$ to 0.5.

（2）The effect of different similarity estimation scheme

①They do another experiments with adding non-image information:

in this table, image and non-image construct multi-modalities. Performance in multi-modalities is better than single modality.

②The graph similarity metric is learned by siamese graph convolutional neural network (s-GCN):

$K_{pdist}\left(\boldsymbol{u},\boldsymbol{v}\right)=\frac{\left(\boldsymbol{u}-\overline{\boldsymbol{u}}\right)\cdot\left(\boldsymbol{v}-\overline{\boldsymbol{v}}\right)}{\left\|\left(\boldsymbol{v}-\overline{\boldsymbol{v}}\right)\right\|_2\left\|\left(\boldsymbol{v}-\overline{\boldsymbol{v}}\right)\right\|_2}$

③Comparison of netwowrk similarity:

2.5.4. The influence of the hyperparameters of Hi-GCN

①The number of clusters ( $H$ ), the threshold ( $T$ ) and the kernel parameter ( $\gamma$ ) are three important indices

②In their experiments, they found $H=7$ is the best cluster setting, whereas the best $T$ and $\gamma$ are unknown.

③They found the increasement of $T$ changes the network to sparse

④Moreover, they observed that $T$ has more influence on performance than $\gamma$

⑤Performance on different hyperparameter values:

2.5.5. Comparisons with prior works

①Traditional machine learning separate feature extraction from model learning

②Comparison of different classifiers in ABIDE:

③Comparison of different classifiers in ADNI:

2.5.6. Ablation study and discussion

（1）Varying the threshold of the population network

①Population network for ABIDE and ADNI dataset:

②The thresholds selecting table:

（2）Exchanging the strategies of subjects' initial feature and similarity estimation

They provide the comparison of three feature embedding methods (the first indicates initial feature and the second attribute is similarity):

（3）Fusing the embedding with the graph properties as node features

The authors fiind when adding Ordinal Pattern or t-BNE, embedding might learn more appropriate feature expression:

（4）Evaluating the embedding with the various traditional classifiers

They compared SVM, Random Forest and Logistic Regression classifiers with embedding of f-GCN and hi-GCN respectively:

（5）Evaluating the hierarchical embedding learning with various GCN models

They compare different models:

2.6. Conclusion

The authors recognized the effectiveness and broad prospects of the functional connectivity matrix of fMRI in classifying diseases

3. 知识补充

3.1. Graph Kernel

（1）定义：图核（Graph Kernel）是一种有效的图结构相似度度量方式，可以用于比较不同图结构（如加权图、有向图、无向图、带标签的图等）之间的相似性。图核基于图论中的核方法，通过递归地将图分解为原子子结构，并比较这些原子子结构的所有对，从而将每个图表示为向量。然后，使用两个向量的内积来比较两个图的相似性。

（2）在图核中，通常定义了三种类型的原子子结构，包括：

①Graphlet：原图被分解为一组大小为k的非同构子图graphlet，每个子图都表示为一个节点集合，通过比较这些子图的集合来度量两个图的相似性。

②Subtree Patterns：将图分解为一系列大小不同的子图，每个子图都表示为一个树结构。通过比较这些子图的集合来度量两个图的相似性。

③Random Walks：通过随机游走算法在图中生成一系列路径，并将每个路径表示为一个向量。通过比较这些向量的集合来度量两个图的相似性。

（3）方法和应用：使用图核时，可以根据具体任务和数据特点选择不同的图核方法，例如基于谱图理论的核方法、基于随机游走的核方法、基于神经网络的核方法等。图核在计算机视觉、自然语言处理、生物信息学等领域都有广泛的应用，可以用于比较不同数据点之间的相似性、进行分类和聚类等机器学习任务。

（4）因为没有很好的总结博客或者百科，以上皆来自文心一言，谨慎使用

（5）Weisfeiler-Kehman Graph Kernels笔记参考：《Weisfeiler-Lehman Graph Kernels》论文阅读 - 知乎 (zhihu.com)

（6）图核综述：[1903.11835] 图核综述 (arxiv.org)

3.2. Gram Matrix

3.3. Ridge Classifier

Ridge Classifier是一种机器学习模型，属于sklearn.linear_model，它使用Ridge回归模型来建立一个分类器。该分类器将问题视为回归任务，并使用最小二乘损失来适应分类模型。与正则化的Logistic Regression不同，Ridge Classifier的损失函数为RMSE + l2 penalty。对于两分类，大于0为正例，小于0为负例。对于多分类，会使用One-Vs-Rest model对每一个分类进行预测，然后再合起来进行多分类的预测。在实践中，Ridge Classifier使用的惩罚最小二乘损失允许对具有不同计算性能概要的数值求解器进行各自不同的选择，而且比Logistic Regression要快得多，因为它只需要一次计算投影矩阵。

4. Reference List

Jiang, H. et al. (2020) 'Hi-GCN: A hierarchical graph convolution network for graph embedding learning of brain network and brain disorders prediction', Computers in Biology and Medicine, 127. doi: Redirecting