[论文精读]GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis

夏莉莉iy

已于 2023-10-11 16:02:49 修改

阅读量223

点赞数 1

分类专栏：论文精读文章标签：人工智能学习笔记深度学习机器学习 transformer

于 2023-10-09 11:00:39 首次发布

本文链接：https://blog.csdn.net/Sherlily/article/details/133693299

版权

论文精读专栏收录该内容

57 篇文章 8 订阅

订阅专栏

论文原文：GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis | IEEE Journals & Magazine | IEEE Xplore

论文代码：https://github.com/LarryUESTC/GATE

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用！

2.3.1. Disease Prediction on fMRI Data

2.3.2. GCNs for Disease Prediction on fMRI Data

2.3.3. Self-Supervised Learning

2.4. Method

2.4.1. Multi-View fMRI Dynamic Functional Connectivity Generation

2.4.2. Graph Embedding

2.4.3. Objective Function

2.5. Theoretical Motivation and Analysis on CCA Loss

2.6. Experiments

2.6.1. Experimental Setup

2.6.2. Results and Analysis

2.6.3. Ablation Study

2.7. Discussion

2.7.1. The Needs of Label Efficiency for fMRI

2.7.2. Graph Learning for Neuroimaging

2.7.3. Technical Contributions of Our Work

2.8. Conclusion

3. 知识补充

3.1. Transductive learning

3.2. Adam W optimizer

3.3. Unknown

4. Reference List

1. 省流版

1.1. 心得

（1）虽然作者说是人口图但是为什么在第五节实验里面说得每个节点是个ROI？

（2）怎么但凡涉及到编码解码的论文就全是数学。没什么模型设计，数学大秀

（3）FTD数据库为啥我没找到

1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

①They designed a self-supervised learning (SSL) structure to optimize GCNs, and it called Graph CCA for Temporal sElf-supervised learning on fMRI analysis ( GATE ).

②Traditional models are always relys on plenty of labeled data, and they might be influenced by mislabeled data

③Their training based on fMRI dynamic functional connectives (FC)

④They need to firstly train SSL on unlabeled fMRI population graph and fine-tune the results

spurious adj.虚假的;伪造的;谬误的;建立在错误的观念(或思想方法)之上的

2.2. Introduction

①Sliding window method is widely used in dynamic FC capturing

②Previous works rely on time-consuming labeling

③Contrastive-based SSL, reconstruction-based SSL, and similarity-based SSL are three main SSL strategies categories. Similarity-based SSL is choosen for their approach.

④Challenge 1 for similarity-based SSL: the data augmentations. Obviously they required data with low coupling of labels and low of spurious features.

⑤Challenge 2: design the corresponding consistency loss function. Maximizing the consistency of correlated signals is needed.

⑥The authors firstly augmented fMRI data and generated two views from BOLD:

where each node denotes a subject and which used SSL to capture information. Then adopted GCN encoder to obtain their embedding matrices. Finally give a Canonical Correlation Analysis (CCA) analysis

⑦Contributions: 1) high efficiency, 2) tackling spurious labels in dynamic FC by self-designed GCN-based CCA regularization, 3) includes theoretical discussion, 4) ablation study.

2.3. Related Work

2.3.1. Disease Prediction on fMRI Data

（1）Medical imaging approach examples:

①Magnetic Resonance Imaging (MRI)

② Computed Tomography (CT)

③Positron Emission Tomography (PET)

④X-ray

（2）Structural MRI and functional MRI

①sMRI: nodes are anatomical connections between anatomical connections, edges are topology between them

②fMRI: nodes are functional regions of the brain, edges are correlations between nodes. Additionally, fMRI presents the dynamic changes in a short time

2.3.2. GCNs for Disease Prediction on fMRI Data

（1）Population graph-based models

①Classification based on population

②Nodes are subjects and edges are similarity between subjects

（2）Brain region graph based models

②Classification based on brain region

③Nodes are brain regions and edges are functional or structural connectives among brain regions

2.3.3. Self-Supervised Learning

①Contrastive-based SSL: increase the similarity between local and global representations by tuning positive and negative sample pairs. Additionally, it mostly relies on negative samples. However, it is not suitable for small number of samples or small number of classes.

②Reconstruction-based SSL: transfer input with low dimensional features to high dimension

③Similarity-based SSL: benefit from the coupling between multiple views of the same data.

2.4. Method

①Key components of GATE: 1) Dynamic FC augmentation, 2) GCN encoder, 3) Objective function

②Training procedure: 1) unsupervised pre-training, 2) fine-tuning of pre-training label

③The whole framework

2.4.1. Multi-View fMRI Dynamic Functional Connectivity Generation

Main characteristics are kept , but the predictions may vary in spurious features.

（1）Dynamic Functional Connectivity:

①Sliding window method is used for capturing temporal information

②BOLD signals $S_{i}\in \mathbb{R}^{R\times T}$ , where $R$ denotes the number of brain Regions-Of-Interests (ROIs) in fMRI of the $i$ -th subject, $T$ denotes the length of the segment

③FC matrix $F_{i}\in \mathbb{R}^{R\times R}$ is calculated by Pearson’s correlation between the matched BOLD segments of the paired ROIs

④Then they flatten the upper triangle matrix to $x_{i}$

⑤Population graph $G=\left \{ X,A \right \}$ , where $X$ denotes feature of an individual, $A$ denotes similarities between each subjects, and each node feature comes from the FC matrix

⑥Size of sliding window: $L$

⑦The step of the sliding window: $s$

（2）Step Window Augmentation (S-A):

①There are $M$ sub-segments where $M=\left \lfloor \frac{T-L}{s} \right \rfloor+1$ , hence $\left \{ S_{i}^{1},...,S_{i}^{M} \right \}$ is the set for one subject.

②S-A randomly select $G^a=\{\mathbf{X}^m,\mathbf{A}^m\}$ as the first view and find a neighbor $G^b=\{\mathbf{X}^{m\pm1},\mathbf{A}^{m\pm1}\}$ as the other view

（3）Multi-Scale Window Augmentation (M-A):

①Choose two different window size: $l_{a},l_{b}$

②Then getting two views: $G^a=\{\mathbf{X}^{m,l_a},\mathbf{A}^{m,l_a}\}$ and $G^b=\{\mathbf{X}^{m,l_{b}},\mathbf{A}^{m,l_{b}}\}$

2.4.2. Graph Embedding

①They adopt GCN as their encoder, the function of the $i$ -th layer:

$\mathbf{H}^{(l+1)}=\sigma\left(\mathbf{D}^{-\frac12}\mathbf{A}\mathbf{D}^{-\frac12}\mathbf{H}^{(l)}\boldsymbol{\Theta}^{(l)}\right)$

where $D$ denotes the diagonal matrix of $A$ , ${\Theta}^{(l)}$ is the weight matrix after training, $\mathbf{H}^{(l)}$ is the feature matrix of all subjects

②Normalized views: $\mathbf{Z}^a=f(\mathbf{X}^a,\mathbf{A}^a)$ and $\mathbf{Z}^b=f(\mathbf{X}^b,\mathbf{A}^b)$

2.4.3. Objective Function

①Reconstruction-based SSL may overfit to scattered noises

②They used GATE, which ignores negative sample and avoids reconstruct

③Maximize the correlation between each matrix

④Input-consistency regularization loss:

$\mathcal{L}=-\frac1N\sum_{i=1}^N\frac{\left\langle\mathbf{z}_i^a,\mathbf{z}_i^b\right\rangle}{\|\mathbf{z}_i^a\|\left\|\mathbf{z}_i^b\right\|}+\gamma\sum_{v=a,b}\|(\mathbf{Z}^v)^\top\mathbf{Z}^v-\mathbf{I}\|_F^2$

where $\left \langle \right \rangle$ is the dot product operator, $\gamma$ denotes trade-off coefficient, $v$ denotes one of the view (a or b), $Z^{v}$ denotes the embeddings matrices of view $v$ . What is more, the first part of this function is a regulation term that it keeps the relative activity of features in low dimension. And the second part is to ensure the irrelevance of each dimension.

⑤Replacing $A$ with identity matrix $I$ in order to fine-tune

⑥Activation: ELU, denoted as $\psi \left ( \right )$

2.5. Theoretical Motivation and Analysis on CCA Loss

①The adding of input-consistency regularization decrease the relevance of spurious features and true label and increase performance

②The CCA function:

$\begin{aligned}&\max_f\mathcal{L}_{\mathrm{CCA}}:=\mathbb{E}_{G^a,G^b}[f(G^a)^\top f(G^b)]\\&s.t.\Sigma_{f(G^a),f(G^a)}=\Sigma_{f(G^b),f(G^b)}=\mathbf{I}\end{aligned}$

where $f$ is a normalized non-linear embedding, $\sum$ is covariance matrix ( $\Sigma_{f,f}=\mathbb{E}_{G^a}[f(G^a)f(G^a)]$ )

③Connection between CCA and generalization error of downstream tasks:

$\begin{aligned}\mathcal{L}_{U,V,S}:=\max_{\left\|h\right\|_{L^2(G^b)}=1}\left\|\mathcal{T}h-\mathcal{T}_kh\right\|_{L^2(G^a)}\\\mathcal{T}_kh:=\sum_{i=1}^k\sigma_i\left<v_i,h\right>_{L^2(G^b)}u_i=f^\top\mathbf{w}\end{aligned}$

where $\Gamma$ denotes representation operation, $R$ and $h$ are low rank approximation operator, Singular Value Decomposition (SVD) of $\Gamma$ is $U=[u_1,\ldots,u_k],V=[v_1,\ldots,v_k]$ , $\mathbf{w}=\langle v_i,h\rangle$

④General theorem for non-linear CCA, which presents the approximation error:

$\begin{aligned}e_{apx}(f)&:=\min_\mathbf{W}{\mathbb{E}_{G^a}[\|h^*(G^a)-\mathbf{W}^\top f(G^a)\|^2]}\\&\leq2\mathbb{E}_y[\|h^*-\mathcal{R}w_{b,y}\|^2+\|(\mathcal{R}-\mathcal{T}_k)w_{b,y}\|^2]\end{aligned}$

where $h^{\ast }$ is the optimal function that can predict $Y$ , $\sigma_i:=\mathbb{E}_{G^a,G^b}[f_i(G^a)f_i(G^b)]$

⑤Denote:

$(\mathcal{T}_k\circ h_y)(g_a):=\sum_{i=1}^k\sigma_i\mathbb{E}[f_i(G^b)h_y(G^b)]f(g_a)$

$(\mathcal{R}\circ h_y)(g_a):=\mathbb{E}_Y[\mathbb{E}_{G^b}[h_y(G^b)|Y]|G^a=g_a]$

$\mathbb{E}[h_y(G^b)|Y=y]=\mathbf{1}(Y=y)$

⑥Upper bound of excess risk of downstream task:

这数学是真的一个字都看不下去了救命，而且格式是真的奇怪

2.6. Experiments

2.6.1. Experimental Setup

（1）ABIDE dataset:

①Datasets: Autism brain imaging data exchange (ABIDE) I/II

②Object: health control (HC) vs. autism patient (classification 1)

③Samples: 485 ASD and 544 HCs in ABIDE

④Atlas: Bootstrap Analysis of Stable Cluster parcellation with 122 ROIs (BASC-122)

⑤Node: ROI

⑥Edge: Pearson’s correlation between the time series of BOLD signals of their ROIs

⑦Dimension: 7503 (.....................)

（2）FTD

①Dataset: Frontotemporal dementia (FTD)

②Object: HC vs. dementia (classification 2)

③Samples: 86 HC and 95 FTD in FTD

④Pre-process: DPARSF

⑤Number of ROI: 116

（3）Graph Construction:

①Construct similarity graph $S\in \mathbb{R}^{n\times n}$ with low-dimensional and discriminative features extracted from raw images, where $n$ denotes the number of nodes in the population graph. This approach mitigates the influence of noise, redundant features and the dimensionality curse brought by high-dimensional features.

②Then construct phenotypic graph matrix $\tilde{S}$ with gender, age or gene etc.

③Get initial graph $A=S\bigodot \tilde{S}$

④Only keeping the top- $k$ edge features of one node

⑤Add diagonal matrix $I$ to $A$ , $I+A\rightarrow A$

（4）Comparison Methods (adopt the same window):

①Methods without SSL: vanilla GCN, GAT, SAC-GCN

②Contrastive- based SSL: DGI, MVGRL

③Similarity-based SSL: BGRL, CCA-SSG

（5） Implementation Details:

①Optimizer: Adam W

②Learning rate: 0.001

③ $\gamma =0.2$

④In S-A, $L$ is 30, $s$ is 15

⑤In M-A, $l_{a}$ and $l_{b}$ are randomly chosen from $\left \{ 10,20,30,40,50 \right \}$

⑥Labeled data: 20%, 206 in ABIDE, 36 in FTD

⑦Validation: 5-fold cross validation

（6）Performance Evaluation:

①Evaluation metrics: accuracy, area under the ROC curve (AUC), precision, recall, F1 score

②The higher the matrics, the better the performance

2.6.2. Results and Analysis

①Comparison table:

②Then they change the proportion of labeled data from 10 to 80:

2.6.3. Ablation Study

（1）Effectiveness of Dynamic FC Augmentation:

MA and SA significantly enhance the performance:

（2）Effectiveness of Different SSL Strategies:

They compare Contrastive-based SSL (CL), Reconstruction-based SSL (Re) and their model, while changing object function in CL to InfoNCE loss with random selecting negative samples and in RE to MSE loss with extra decoder:

（3）Different Dimensional Embedding:

Chose the dimension from {16, 32, 64, 128, 256, 512, 1024}, the performances relatively reach peak when chose 256 for ABIDE and 128 for FTD:

Hence they chose 256 in all the experiments in that low dimensionality is lack of representation ability and high dimensionality will consume computational time

（4）Effectiveness of γ in the Objective Function:

γ in objective function tends to stabilize at values of 0.1-0.8:

（5）Effectiveness of Fine-Tuning and Graph:

GATE without fine-tuning or without graph (replace original $A$ by $I$ ) in SSL:

Fine-tuning is for obtaining correct labeled data, and graph structure is for providing common biomarkers

（6）Low-Rank Representation:

As low-provide common biomarkers, GATE is able to reduce the upper limit of excess risk for downstream tasks. Here is the comparison of GATE and vanilla GCN:

（7）Parameter Sensitivity Analysis:

They research whether GATE is sensitive to sliding-window parameters, such as window length, step sizes or gaps of multiple windows:

2.7. Discussion

2.7.1. The Needs of Label Efficiency for fMRI

Exactly, the more the labeled data, the higher the accuracy. However, it is obviously big challenge of getting plenty of labeled images. Thus, designed GATE is able to achieve excellent performance under 20% labels that its accuracy is similar to vanilla GCN under 50%.

2.7.2. Graph Learning for Neuroimaging

GATE shows better extraction of associations between subjects. Then it maximize the correlation.

2.7.3. Technical Contributions of Our Work

①SSL strategy produces multiple coupled views of a fMRI BOLD signal

②Pre-processing and fine-tuning

2.8. Conclusion

GATE, which used in population graph, implementes high-precision functionality in small amounts of labeled data and noisy environments

3. 知识补充

3.1. Transductive learning

3.2. Adam W optimizer

（1）在Adam优化器的基础上增加了weight decay正则化，相当于衰减了原先的权重

3.3. Unknown

（1）corruption function

（2）feature collapse

4. Reference List

Peng, L. et al. (2022) 'GATE: Graph CCA for Temporal Self-Supervised Learning for Label-Efficient fMRI Analysis', IEEE Transactions on Medical Imaging, vol. 42, issue. 2, pp. 391-402. doi: 10.1109/TMI.2022.3201974