GNN-CS224W: 6-7 Graph Neural Networks

最新推荐文章于 2024-09-16 14:49:33 发布

当客

最新推荐文章于 2024-09-16 14:49:33 发布

阅读量146

点赞数

文章标签：人工智能

本文链接：https://blog.csdn.net/u012492756/article/details/118153715

版权

GNN

概述

GNN定义了一种可以批量encode graph 类型数据的方法的深度方法

可以利用local graph structure，也可以利用node feature

computational graph

如下图所示：

node 的local structure 定义了它的computational graph
每个节点都计算相同的depth

node的输入就是feature，link体现在node 的computational graph上

同一个node在不同的层会有不同的embedding

a simple method

计算方式

一种简单的方法：将各个neighboer的vector 求和或者求平均再经过一层mlp

$h_v^0=x_v$

$x_v$ 表示节点 $v$ 的输入feature vector， $h_v^0$ 表示节点 $v$ 在第0层的表示

$\forall l\in \{0,\dots,L-1\}$ ( $L$ 为层数)计算如下：
$h_v^{(l+1)}=\sigma(W_l \sum\limits_{u \in N(v)} \frac{h_u^{(l)}}{|N(v)|} + B_l h_v^{(l)})$

Neighborhood aggregation: $W_l \sum\limits_{u \in N(v)} \frac{h_u^{(l)}}{|N(v)|}$ , 表示的是将邻居的上一层的embedding求平均再乘一个矩阵；
Self transformation: $B_l h_v^{(l)}$ 表示上一层node $v$ 自己的embedding 乘以一个矩阵；
两个合起来就是node的第 $l$ 层的embedding， aggregate了自己的上一层embedding和自己neighbor的上一层的embedding
$W_l$ 和 $B_l$ 对每一层的来说是共享的，各层之间是不同的
在第0层，各个node的embedding只是包括自己的信息，depth=0；第1层，包括了自己和直接neighbor的信息，depth=1；第2层集成了已经包括自己neighbor的neighbor，depth=2；第3层depth=3；…… 每多一层node的embedding包含的信息的depth就加1
计算固定的depth

$z_v=h_v^{(L)}$

第L层就是最终的node embedding

matrix form

将第 $l$ 层各个node的embedding表示为：
$H^{(l)}=[h_{1}^{(l)},\dots,h_{|V|}^{(l)}]^T$

$H^{(l)}$ 中节点的表示为某一行，如下图所示
在这里插入图片描述

$A H^{(l)}$ 为一个size为(node_num, embed_dim)的矩阵，每个node的对应的行表示它的neighbor的上一次的representation求和， $A$ 为adjacency matrix

Let $D$ be diagonal matrix(对角矩阵，只有对角元素不为0的矩阵) where $D_{v,v}=Degree(v)=|N(v)|$

那么 $D^{-1}$ ( $D$ 的逆矩阵) 也是diagonal matrix，并且 $D_{v,v}^{-1}=\frac{1}{|N(v)|}$ 。 (这里提到逆矩阵并没有意义，只要知道矩阵的对角线元素为 $\frac{1}{|N(v)|}$ 即可)

$D^{-1}$ 和 $A H^{(l)}$ 相乘就是给每个node的representation除以了node的neighbor数量，于是Neighborhood aggregation可以表示成 $D^{-1}AH^{(l)}$

$H^{(l+1)}=\sigma(D^{-1}A H^{(l)} W_l^T + H^{(l)} B_l^T)$

not all GNNs can be expressed in matrix form, when aggregation function is complex

learning objective

supervized

如果node有label可以用node的label
unsupervized

也可以Use the graph structure as the supervision，方法和deepwalk、node2vec类似，让node的embedding之间的product代表similarity

可以用交叉熵，定义相似的node的similarity为1，然后最小化下式

$\mathcal{L}=\sum\limits_{z_u, z_v} CE(y_{u,v}, Decoder(z_u, z_v))$

其中 $y_{u,v}$ 为label，即similarity， $Decoder(z_u, z_v))$ 用来计算两个node embedding的similarity，可以是向量点积

优点

可以适用于新的node
可以适用于动态变化的graph
参数很少
1. 所有node共享 $W_l$ 和 $B_l$
2. $W_l$ 和 $B_l$ 只和embedding dim和node feature数量有关，和graph里的node数量无关，所以参数很少。
order/permutaion invariant：方法保证了neighor以任意顺序被处理，所得到的信息是一样的。因为node的neighbor是没有顺序的。
是深度模型
可以利用local graph structure，也可以利用node feature

缺点

这种方法只利用的node的local信息，整个graph还是利用不到

A General GNN Framwork

定义一个GNN 主要是要定义如下几方面：

message
children node要传递给下一层的message是什么，如何定义
aggregation

接收到上一层的message如何做aggregation

1和2定义了一层GNN Layer，不同的GNN的差别主要在这里
Layer connectivity

layer之间是怎么连接的
Graph augmentation

Idea: Raw input graph ≠ computational graph
Graph feature augmentation
Graph structure augmentation
Learning objective
1. Supervised/Unsupervised objectives
2. Node/Edge/Graph level objectives

A single layer of GNN

在这里插入图片描述

上一层传过来的各个neighbor的message和上一层自己的message是一个set，没有顺序，所以处理方式要做到order invariant

Message computation

对neighbor node和自己的上一层的representation做处理，得到message

message function：
$m_u^{(l)}=MSG^{(l)}(h_u^{(l-1)})$

上式表示node $u$ 的第 $l$ 层的message获取方法

例如： $m_u^{(l)}=W^{(l)} h_u^{(l-1)}$

node自己从上一层传过来的message和neighbor可以分开计算，例如 $m_u^{(l)} = B^{(l)} h_v^{(l-1)}$
Aggregation
$h_v^{(l)}= AGG^{(l)}(\{ m_u^{(l)}, u \in N(v) \} \cup \{ m_v^{{l}}\})$

$m_v^{{l}}$ 是node $v$ 自己的message

aggregation操作可以是sum、average、max等，neighbor和node自己的message可以拼接、相加等
activate function
注意这个操作是在 aggregation完成之后才做的

以下为例子

Graph Convolutional Networks (GCN)

好像就是上面讲到的a simple method，需要确认一下

GraphSage

在GCN的基础上扩展了一些内容
$h_v^{(l)}= \sigma(W^{{l}} \cdot CONCAT(h_v^{(l-1)}, AGG^{(l)}(\{ h_u^{(l-1)}, u \in N(v) \} )))$

这里的AGG可以是sum、average等方法，并且包括了message computation 操作

有两次aggregation：

aggregate neighbor message
aggregate 自己上一层的信息和neighbor message

几种aggregation

Mean
Pool

Transform neighbor vectors(MLP) and apply symmetric vector function Mean(⋅) or Max(⋅)

例如 $AGG=Mean(\{ MLP( h_u^{(l-1)} ), \forall u \in N(v) \})$

比直接求均值增加了一层MLP
LSTM
$h_u^{(l-1)} , \forall u \in N(v) ] )$

计算时需要 reshuffle the neighbors (每一次计算都需要shuffle)，因为LSTM是带有顺序的，而我们需要 order invariant

L2 Normalization

给每一层的 $h_v^{(l)}$ 应用L2 Normalization

$h_v^{(l)} \leftarrow \frac{h_v^{(l)}}{||h_v^{(l)}||_2}, \forall v \in V$

向量 $u$ 的L2 Normalization为 $||u||_2=\sqrt{\sum\nolimits_i u_i^2}$ ，这样 $u$ 的Euclidean length将永远是1。相当于把所有向量都放在了n维空间中以原点为球心、半径为1的球面上。

问题：这为什么会导致效果好呢？

作用：

Without L2 normalization, the embedding vectors have different scales (L2-norm) for vectors
In some cases (not always), normalization of embedding results in performance improvement
After L2 normalization, all vectors will have the same L2-norm，都是1

问题：这是在激活之前还是激活之后做？

Graph Attention Networks (GAT)

$h_v^{(l)}= \sigma( \sum\limits_{u \in N(v)} \alpha_{vu} W^{(l)} h_u^{(l-1)})$

$\alpha_{vu}$ 为attention weight，表示 $v$ 的neighbor $u$ 的message的重要程度

为什么要用attention？

GCN、GraphSage中node的neighbor的message都是同等对待的，例如 GCN中的 $W_l \sum\limits_{u \in N(v)} \frac{h_u^{(l)}}{|N(v)|}$ ，把node $v$ 的所有neighbor的message求和再除以neighbor数量。

但并不是所有neighbor都同等重要，the NN should devote more computing power on that small but important part of the data。

attention 怎么得到？

Which part of the data is more important depends on the context and is learned through training.

compute attention coefficient $e_{vu}$
$e_{vu}=a(W^{(l)} h_u^{(l-1)}, W^{(l)} h_v^{(l-1)})$

上式中的 $a$ 可以有不同的方法，以下为几个例子：
1. 可以是 a simple single-layer neural network，则
  $e_{vu} \\=a(W^{(l)} h_u^{(l-1)}, W^{(l)} h_v^{(l-1)}) \\=Linear(Concat(W^{(l)} h_u^{(l-1)}, W^{(l)} h_v^{(l-1)}))$
  
  用这种方法很finicky，有时候很难converge，所以可以尝试下面的Multi-head attention
2. Multi-head attention
  
  可以Stabilizes the learning process of attention mechanism
  
  还是用上面的方法计算attention，但是同时计算多个不同的attention，最后再aggregate起来，如下图
Normalize $e_{vu}$ into the final attention weight $a_{vu}$
用softmax来做Normalize，使得 $\sum_{k \in N(v)}a_{vk}=1$
$a_{vu}=\frac{\exp{e_{vu}}}{\sum_{k \in N(v)}\exp{e_{vk}}}$

attention可以是不对称的，即节点v对u的weight大时，节点u对v的weight可以小也可以大。

GNN Layer in Practice

Modern deep learning modules can be included into a GNN layer for better performance
在这里插入图片描述

In GNN, Dropout is applied to the linear layer in the message function

Parametric ReLU (PReLU) empirically performs better than ReLU

这是一个快速测试graph network design 性能的工具GraphGym

Stacking GNN Layers

Stack GNN layers sequentially

在这里插入图片描述

the over-smoothing problem

all the node embeddings converge to the same value

我们需要encode不同的node的信息，全部都一样能得到的信息很少

为什么会发生？

Receptive field: the set of nodes that determine the embedding of a node of interest (问题，interest是什么？)
In a K-layer GNN, each node has a receptive field of K-hop neighborhood

随着hop的增加，Receptive field overlap for two nodes (the shared neighbors) 增加的非常快

在这里插入图片描述
the embedding of a node is determined by its receptive field, If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar