[论文精读]Graph Attention Networks_graph attention network 原论文-CSDN博客

①CNN has been widely used in translation, image classification, semantic segmentation. However, it can not be used in none-grid, i.e. irregular representation, such as social/telecommunication/biological networks, 3D meshes, brain connectomes. Thus, graph structure can describe these structures more accurately

②Early works adopted recursive neural networks to process directed acyclic graphs

③They introduced spectral and non-spectral methods of graph processing

④Allowing different sizes of input, attention mechanism has been sucessfully used in NLP

⑤Attention mechanism is able to parallelize neighbors, assign weights to neighbors and be used in inductive learning

acyclic adj.无环的;非循环的;非周期的;非环状的

reminiscent adj.怀旧的;使回忆起(人或事);回忆过去的;缅怀往事的 n.回忆者;追记前事者

2.3. GAT architecture

2.3.1. Graph attention layer

①Input matrix:

$\mathbf{h}=\{\vec{h}_{1},\vec{h}_{2},\ldots,\vec{h}_{N}\},\vec{h}_{i}\in\mathbb{R}^{F},\mathbf{h}\in \mathbb{R}^{F\times N}$

where $N$ denotes the number of nodes, $F$ denotes the number of features

②Then transfer node features to higher level with shared weight matrix:

$e_{ij}=a(\mathbf{W}\vec{h_i},\mathbf{W}\vec{h_j})$

where $a:\mathbb{R}^{F^{'}}\times \mathbb{R}^{F^{'}}\rightarrow \mathbb{R}$ is a attention mechanism;

$j$ is the neighbor node in the neighborhood of $i$ ;

also, indicates the importance of node $j$ 's features to node $i$ .

③Normalize neighbors:

$\alpha_{ij}=\text{softmax}_j(e_{ij})=\frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}_i}\exp(e_{ik})}$

where $\mathcal{N}_i$ denotes neighborhood of node $i$ and the order is set by 1, i.e. first-order neighbors.

④Further expanding function $a\left ( \right )$ :

$\alpha_{ij}=\frac{\exp\left(\text{LeakyReLU}\left(\vec{\mathbf{a}}^T[\mathbf{W}\vec{h}_i\|\mathbf{W}\vec{h}_j]\right)\right)}{\sum_{k\in\mathcal{N}_i}\exp\left(\text{LeakyReLU}\left(\vec{\mathbf{a}}^T[\mathbf{W}\vec{h}_i\|\mathbf{W}\vec{h}_k]\right)\right)}$

which is a single-layer feedforward neural network,

and where $\vec{\mathbf{a}}\in\mathbb{R}^{2F^{\prime}}$ denotes a weight vector;

negative slope $\alpha =0.2$ ;

|| denotes concatenation.

⑤Applying nonlinearity to get final output:

$\vec{h}_i'=\sigma\left(\sum_{j\in\mathcal{N}_i}\alpha_{ij}\mathbf{W}\vec{h}_j\right)$

⑥They further introduce multi-head attention with concatenation:

$\vec h'_i=\parallel _{k=1}^{K}\sigma\left(\sum\limits_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)$

where $\alpha_{ij}^k$ denotes normalized attention coefficients caculated by the $k$ -th attention mechanism $a^{k}$

⑦In prediction layer, averaging is much more sensible than multi-head:

$\vec h'_i=\sigma\left(\frac1K\sum_{k=1}^K\sum_{j\in\mathcal{N}_i}\alpha_{ij}^k\mathbf{W}^k\vec h_j\right)$

⑧The figure of this model:

where the left is attention mechanism and the right is multi-head attention mechanism with $K=3$

2.3.2. Comparisons to related work

（1）Their improvements:

①There is no need for eigendecomposition or other time-consuming calculation. Furthermore, $K$ multi-head operations can also be parallelized

②GAT allows to assign weights to neighbors

③Adopting to directed graph with imiting $a_{ij}$ when there is no edge in $j\rightarrow i$

④Applicable to inductive

⑤GraphSAGE can not process the whole neighborhood but GAT can

⑥Compared with MoNet, which computes the node structure, GAT adopts similarity computations

on par with 与...相当

par n.（股票的）面值，票面价值;<高尔夫>标准杆数;平均量，常态，一般水平（或标准）;标准（尤指某人的工作或健康）水准 adj.平价的，与票面价值相等的;平均的，正常的 vt.<高尔夫>标准杆数得分

2.4. Evaluation

Datasets information:

2.4.1. Datasets

（1）Transductive learning

①In the left three datasets, nodes represent documents, undirected edges represent citations and node features represent elements of a bag-of-words representation of a document

②Class: 20 node

（2）Inductive learning

①Pre-processing: provided by Hamilton et al. (GraphSAGE)

2.4.2. State-of-the-art method

（1）Transductive learning

Comparison table:

（2）Inductive learning

Comparison table:

where Const-GAT adopts constant attention mechanism i.e. adopting same weight for each neighbor

（3）Summary

They provide MLP for each node

2.4.3. Experimental setup

（1）Transductive learning

①They adopted 2 layers model. The first layer uses $K=8$ , and $F{}'=8$ for each multi-head. Then follows ELU. The second layer sets $C$ (number of classes) features in one attention head, then follows Softmax.

②Moreover, L2 regularization with $\lambda =0.0005$

③Dropout rate: 0.6

（2）Inductive learning

①They chose 3 layer model. In the first two layers, $K=4$ and $F{}'=256$ with a latter ELU. The third layer adopts $K=6$ followed by logistic Sigmoid.