[论文精读]Graph Attention Networks

论文原文:[1710.10903] Graph Attention Networks (arxiv.org)

论文代码:https://github.com/PetarV-/GAT

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用!

目录

1. 省流版

1.1. 心得

1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. GAT architecture

2.3.1. Graph attention layer

2.3.2. Comparisons to related work

2.4. Evaluation

2.4.1. Datasets

2.4.2. State-of-the-art method

2.4.3. Experimental setup

2.4.4. Results

2.5. Conclusions

3. 知识补充

3.1. Spectral and non-spectral approaches for GNN

3.2. Spectral domain and frequency domain

3.3. t-SNE

4. Reference List


1. 省流版

1.1. 心得

(1)Intro里面就包含了related work的样子?

(2)狠狠赞扬Datasets的表格,我都不用总结了

1.2. 论文框架图

2. 论文逐段精读

2.1. Abstract

        ①They proposed a graph attention networks (GATs), which is both suitable for inductive and transductive problems

        ②There is no need for special and costly matrix operation

        ③They test their model in Cora, Citeseer, Pubmed citation network datasets and proteinprotein interaction dataset

upfront  adj.预付的;坦率的;诚实的;直爽的;预交的  adv.预付地,先期支付地

2.2. Introduction

        ①CNN has been widely used in translation, image classification, semantic segmentation. However, it can not be used in none-grid, i.e. irregular representation, such as social/telecommunication/biological networks, 3D meshes, brain connectomes. Thus, graph structure can describe these structures more accurately

        ②Early works adopted recursive neural networks to process directed acyclic graphs

        ③They introduced spectral and non-spectral methods of graph processing

        ④Allowing different sizes of input, attention mechanism has been sucessfully used in NLP

        ⑤Attention mechanism is able to parallelize neighbors, assign weights to neighbors and be used in inductive learning

acyclic  adj.无环的;非循环的;非周期的;非环状的

reminiscent  adj.怀旧的;使回忆起(人或事);回忆过去的;缅怀往事的  n.回忆者;追记前事者

2.3. GAT architecture

2.3.1. Graph attention layer

        ①Input matrix:

\mathbf{h}=\{\vec{h}_{1},\vec{h}_{2},\ldots,\vec{h}_{N}\},\vec{h}_{i}\in\mathbb{R}^{F},\mathbf{h}\in \mathbb{R}^{F\times N}

where N denotes the number of nodes, F denotes the number of features

        ②Then transfer node features to higher level with shared weight matrix:

e_{ij}=a(\mathbf{W}\vec{h_i},\mathbf{W}\vec{h_j})

where a:\mathbb{R}^{F^{'}}\times \mathbb{R}^{F^{'}}\rightarrow \mathbb{R} is a attention mechanism;

j is the neighbor node in the neighborhood of i;

also, indicates the importance of node j's features to node i.

        ③Normalize neighbors:

\alpha_{ij}=\text{softmax}_j(e_{ij})=\frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}_i}\exp(e_{ik})}

where \mathcal{N}_i denotes neighborhood of node i and the order is set by 1, i.e. first-order neighbors.

        ④Further expanding function a\left ( \right ):

\alpha_{ij}=\frac{\exp\left(\text{LeakyReLU}\left(\vec{\mathbf{a}}^T[\mathbf{W}\vec{h}_i\|\mathbf{W}\vec{h}_j]\right)\right)}{\sum_{k\in\mathcal{N}_i}\exp\left(\text{LeakyReLU}\left(\vec{\mathbf{a}}^T[\mathbf{W}\vec{h}_i\|\mathbf{W}\vec{h}_k]\right)\right)}

which is a single-layer feedforward neural network,

and where \vec{\mathbf{a}}\in\mathbb{R}^{2F^{\prime}} denotes a weight vector;

negative slope \alpha =0.2 ;

|| denotes concatenation.

        ⑤Applying nonlinearity to get final output:

\vec{h}_i'=\sigma\left(\sum_{j\in\mathcal{N}_i}\alpha_{ij}\mathbf{W}\vec{h}_j\right)

        ⑥They further introduce multi-head attention with concatenation:

\vec h'_i=\parallel _{k=1}^{K}\sigma\left(\sum\limits_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)

where \alpha_{ij}^k denotes normalized attention coefficients caculated by the k-th attention mechanism a^{k}

        ⑦In prediction layer, averaging is much more sensible than multi-head:

\vec h'_i=\sigma\left(\frac1K\sum_{k=1}^K\sum_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)

        ⑧The figure of this model:

where the left is attention mechanism and the right is multi-head attention mechanism with K=3 

2.3.2. Comparisons to related work

(1)Their improvements:

        ①There is no need for eigendecomposition or other time-consuming calculation. Furthermore, K multi-head operations can also be parallelized

        ②GAT allows to assign weights to neighbors

        ③Adopting to directed graph with imiting a_{ij} when there is no edge in j\rightarrow i

        ④Applicable to inductive

        ⑤GraphSAGE can not process the whole neighborhood but GAT can

        ⑥Compared with MoNet, which computes the node structure, GAT adopts similarity computations

on par with  与...相当 

par  n.(股票的)面值,票面价值;<高尔夫>标准杆数;平均量,常态,一般水平(或标准);标准(尤指某人的工作或健康)水准  adj.平价的,与票面价值相等的;平均的,正常的  vt.<高尔夫>标准杆数得分

2.4. Evaluation

        Datasets information:

2.4.1. Datasets

(1)Transductive learning

        ①In the left three datasets, nodes represent documents, undirected edges represent citations and node features represent elements of a bag-of-words representation of a document

        ②Class: 20 node

(2)Inductive learning

        ①Pre-processing: provided by Hamilton et al. (GraphSAGE)

2.4.2. State-of-the-art method

(1)Transductive learning

        Comparison table:

(2)Inductive learning

        Comparison table:

where Const-GAT adopts constant attention mechanism i.e. adopting same weight for each neighbor

(3)Summary

        They provide MLP for each node

2.4.3. Experimental setup

(1)Transductive learning

        ①They adopted 2 layers model. The first layer uses K=8, and F{}'=8 for each multi-head. Then follows ELU. The second layer sets C (number of classes) features in one attention head, then follows Softmax.

        ②Moreover, L2 regularization with \lambda =0.0005

        ③Dropout rate: 0.6

(2)Inductive learning

        ①They chose 3 layer model. In the first two layers, K=4 and F{}'=256 with a latter ELU. The third layer adopts K=6 followed by logistic Sigmoid.

        ②The set is large enough to ignore L2 regularization and dropout

        ③Adopting skip connections in the middle attention layers

(3)Summary

        ①Initialization: Glorot

        ②Optimizer: Adam SGD

        ③Learning rate: 0.01 for Pubmed, and 0.005 for others

        ④Early stopping strategy: 100 epochs

2.4.4. Results

        ①They tune and adjust other model to be similar to GAT for fair

        ②Transformed feature representations visualization with 7 labels:

where this figure comes from the first layer in GAT on Cora dataset

2.5. Conclusions

        ①REPEAT their low time-costing of parallelizable and easy matrix operation.

        ②“一个特别有趣的研究方向是利用注意机制对模型的可解释性进行彻底的分析”?啊???这种话真是从2018做到2023了可解释性都还没有结果呢

        ③Considering edge feature is feasible

3. 知识补充

3.1. Spectral and non-spectral approaches for GNN

3.2. Spectral domain and frequency domain

(1)Spectral domain: mainly used in GNN, adopting Fourier transform on space dimensionality

(2)Frequency domain: mainly used in signal and image processing, adopting Fourier transform on temporal dimensionality

3.3. t-SNE

4. Reference List

Velickovic, P. et al. (2018) 'Graph Attention Networks', ICLR 2018. doi: https://doi.org/10.48550/arXiv.1710.10903

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值