图机器学习基础知识——CS224W（06&07-GNN）

最新推荐文章于 2024-06-18 12:00:24 发布

ZreviaX

最新推荐文章于 2024-06-18 12:00:24 发布

阅读量1k

点赞数 32

分类专栏：图机器学习基础知识文章标签：机器学习人工智能深度学习图卷积神经网络图机器学习

本文链接：https://blog.csdn.net/WindGrin_/article/details/137870818

版权

图机器学习基础知识专栏收录该内容

22 篇文章 1 订阅

订阅专栏

CS224W: Machine Learning with Graphs

Stanford / Winter 2021

06-GNN

Graph Convolutional Networks (GCN)

几个要点
- GCN的具体卷积方式就是Message Passing
- 当一个节点收到来自邻居的消息时，先聚合（mean、sum…）然后通过一个权重矩阵进行线性变换得到下一层的初始节点表示（该参数层内对所有节点共享）
- Equation Form（考虑了自身特征）
  
  $\mathrm{h}_{v}^{(l+1)}=\sigma\left(\mathrm{W}_{l} \sum_{u \in \mathrm{N}(v)} \frac{\mathrm{h}_{u}^{(l)}}{|\mathrm{N}(v)|}+\mathrm{B}_{l} \mathrm{~h}_{v}^{(l)}\right), \forall l \in\{0, \ldots, L-1\}$
- Matrix Form（考虑了自身特征）
  
  $H^{(l+1)}=\sigma\left(\tilde{A} H^{(l)} W_{l}^{\mathrm{T}}+H^{(l)} B_{l}^{\mathrm{T}}\right), \quad \tilde{A}=D^{-1} A$
- 不是所有的GNN都有对应的Matrix Form，一些复杂的结构就没有
- 原始GCN可学习参数只有邻居信息的线性变换参数以及自身特征的线性变换参数，且这两个参数在层内是所有节点共享的
- GCN几层就代表经过了多少个hop的传播，考虑到了k-hop以内的信息
Matrix Form Analysis
- 首先以上述公式为例
  
  $H^{(l+1)}=\sigma\left(\tilde{A} H^{(l)} W_{l}^{\mathrm{T}}+H^{(l)} B_{l}^{\mathrm{T}}\right), \quad \tilde{A}=D^{-1} A$
  - $AH^{(l)}$ 表示邻居节点特征向量的聚合操作（没考虑节点自身特征向量），这个形式记住就好，只代表邻居节点的聚合操作。因为邻接矩阵的对角元素全为0，所以 $AH^{(l)}$ 只会计算邻居的聚合信息，而不会计算自身。若想要计算自身，则需要将对角元素变为非0即可，这是数学本质上的原理。
  - $D^{-1}AH^{(l)}$ 则表示邻居节点信息聚合平均的过程（仍未考虑自身特征向量）
  - ！！！对角矩阵与任意矩阵相乘可以看作是对任意矩阵每行元素的分别缩放！！！
    
    注意是对角矩阵乘任意矩阵，而不是任意矩阵乘对角矩阵
    
    对角矩阵的逆矩阵仍未对角矩阵，对角元素变成原来的倒数即可
    - 若对角矩阵的对角元素均一致，则可以表示为n倍的单元矩阵形式，与任意矩阵相乘后相当于对该矩阵的所有元素进行了缩放n倍
    - 若对角矩阵的对角元素不一致，假设为 $d_1, d_2, d_3, ..., d_n$ ，则与任意矩阵相乘后相当于分别对该矩阵的每一行元素进行了对应的缩放。例如，对第一行元素缩放 $d_1$ 倍，第二行元素缩放 $d_2$ 倍…（可将对角矩阵形状写出，按相乘规则运算就可以得到这个结论）
- 以下式为例
  
  $H^{l+1}=\sigma\left(A H^{l} W^{l}\right)$
  - 显然只考虑了邻居，未考虑自身的特征向量
- 以下式为例
  
  $H^{l+1}=\sigma\left(L H^{l} W^{l}\right), \quad L = D - A$
  - $LH^l = (D-A)H^l = DH^l - AH^l$ ， $AH^l$ 表示邻居节点聚合， $DH^l$ 表示节点自身的缩放
  - 另一个角度， $L = D - A$ 代替了邻接矩阵 $A$ ，其实就是让 $L$ 的对角线元素非0，所以自身特征能够被计算进来
  - 所以该式考虑到了自身节点的特征向量
- 以下式为例
  
  $H^{l+1}=\sigma\left(D^{-\frac{1}{2}} \hat{A} D^{-\frac{1}{2}} H^{l} W^{l}\right), \quad \tilde{A} = A + I$
  - $\tilde{A} = A + I$ 这个操作其实就是让邻接矩阵的对角元素变为非0元素，进而把自身特征也考虑进来
  - GCN主流实现方式，考虑了自身节点特征
GNN的Equation Form本质上其实就是针对每个节点而言的更新公式，而Matrix Form其实就是针对所有节点的更新公式，在一次矩阵运算中计算出所有节点的下一层特征向量。论文中不一定要将GNN架构表示为矩阵形式（当然最好表示成矩阵），可以直接用单节点的Equation Form更新公式（DGL API文档里也是用Equation Form）

07-GNN

Paper : Design Space for Graph Neural Networks

Paper : SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

Paper : Inductive Representation Learning on Large Graphs

主要讨论当下流行的GNN（主要是GCN）架构设计原则

Computation Graph (计算图)
- 节点的邻居构成了一个计算图，代表节点消息聚合的过程

A Single Layer of a GNN

GNN单层架构

GNN Layer = Message + Aggregation
Message Computation

消息计算

$\mathbf{m}_{u}^{(l)}=\mathrm{MSG}^{(l)}\left(\mathbf{h}_{u}^{(l-1)}\right)$

$\mathbf{m}_{u}^{(l)}=\mathbf{W}^{(l)} \mathbf{h}_{u}^{(l-1)}$
- 在Transformation部分，每个邻居的消息都分别乘权重，或消息聚合之后再乘权重
Message Aggregation

消息聚合

$\mathbf{h}_{v}^{(l)}=\mathrm{AGG}^{(l)}\left(\left\{\mathbf{m}_{u}^{(l)}, u \in N(v)\right\}\right)$

$\mathbf{h}_{v}^{(l)}=\operatorname{Sum}\left(\left\{\mathbf{m}_{u}^{(l)}, u \in N(v)\right\}\right)$
- 此处未考虑节点本身特征向量，来自于节点 $v$ 本身的信息可能丢失
  - Solution：在计算 $h_v^{(l)}$ 时包括 $h_v^{(l-1)}$
    
    $\mathbf{m}_{u}^{(l)}=\mathbf{W}^{(l)} \mathbf{h}_{u}^{(l-1)}$
    
    $\mathbf{m}_{v}^{(l)}=\mathbf{B}^{(l)} \mathbf{h}_{v}^{(l-1)}$
    
    $\mathbf{h}_{v}^{(l)}=\operatorname{CONCAT}\left(\operatorname{AGG}\left(\left\{\mathbf{m}_{u}^{(l)}, u \in N(v)\right\}\right) \mathbf{m}_{v}^{(l)}\right)$
    - 分别将邻居消息聚合与自身节点特征向量进行Transformation（这里为线性变换），而后通过Concat或Sum聚合
- 聚合操作有 $S u m (\cdot), M e an (\cdot), M a x (\cdot)$ 等
- 可以在Message或者Aggregation之后加上激活函数增强非线性表达能力
- GCN聚合邻居节点的操作是Order-invariant（与次序无关），聚合操作不能依赖于邻居节点的某种特定的计算顺序，也就是说无论以哪种顺序去聚合邻居节点的信息，得到的结果应该是一致的

Classical GCN

Message + Aggregation

$\mathbf{h}_{v}^{(l)}=\sigma\left(\sum_{u \in N(v)} \mathbf{W}^{(l)} \frac{\mathbf{h}_{u}^{(l-1)}}{|N(v)|}\right)$

$\mathbf{h}_{v}^{(l)}=\sigma\left(\operatorname{Sum}\left(\left\{\mathbf{m}_{u}^{(l)}, u \in N(v)\right\}\right)\right)$
使用节点的度进行Normalize（和GCN原文有些许不同）
- 原始GCN是没有考虑到自身节点特征的，这一点在DGL库里也能得到验证

GraphSAGE

Message + Aggregation

$\mathbf{h}_{v}^{(l)}=\sigma\left(\mathbf{w}^{(l)} \cdot \operatorname{CONCAT}\left(\mathbf{h}_{v}^{(l-1)}, \operatorname{AGG}\left(\left\{\mathbf{h}_{u}^{(l-1)}, \forall u \in N(v)\right\}\right)\right)\right)$
Aggregation Strategy
- Mean
  
  $\text { AGG }=\sum_{u \in N(v)} \frac{\mathbf{h}_{u}^{(l-1)}}{|N(v)|}$
- Pool
  
  $\mathrm{AGG}=\operatorname{Mean}\left(\left\{\operatorname{MLP}\left(\mathbf{h}_{u}^{(l-1)}\right), \forall u \in N(v)\right\}\right)$
  Mean(·) or Max(·)
- LSTM
  
  $\text { AGG }=\operatorname{LSTM}\left(\left[\mathbf{h}_{u}^{(l-1)}, \forall u \in \pi(N(v))\right]\right)$
  - LSTM聚合存在一个问题，邻居聚合操作要求是order-invariant，但Sequence Model明显是与输入顺序有关的模型。这里作者将输入次序进行permute再输入LSTM，从而不至于让LSTM捕捉到特定的输入顺序
L2 Normalization
- 每一层都对节点特征向量应用L2 Norm
  
  $\mathbf{h}_{v}^{(l)} \leftarrow \frac{\mathbf{h}_{v}^{(l)}}{\left\|\mathbf{h}_{v}^{(l)}\right\|_{2}} \forall v \in V$
- 若没有L2 Norm，特征向量的L2-norm scale不同
- 一些场景下（NOT ALWAYS），Norm能提高一定效果

GAT

Message + Aggregation

$\mathbf{h}_{v}^{(l)}=\sigma\left(\sum_{u \in N(v)} \alpha_{v u} \mathbf{W}^{(l)} \mathbf{h}_{u}^{(l-1)}\right)$
Attention Mechanism

$e_{v u}=a\left(\mathbf{W}^{(l)} \mathbf{h}_{u}^{(l-1)}, \mathbf{W}^{(l)} \boldsymbol{h}_{v}^{(l-1)}\right)$

$\alpha_{v u}=\frac{\exp \left(e_{v u}\right)}{\sum_{k \in N(v)} \exp \left(e_{v k}\right)}$
$e_{vu}$ indicates the importance of $u$ ’s message to node $v$ (注意attention可以是不对称的，即 $e_{vu}$ 与 $e_{uv}$ 可以是不一样的)
- attention可以是不对称的，比如节点a->b与节点b->a的attention可以是不一样的
- attention的具体操作 $a$ 可以是多种，下面为例子
  
  $\begin{aligned} &e_{A B}=a\left(\mathbf{W}^{(l)} \mathbf{h}_{A}^{(l-1)}, \mathbf{W}^{(l)} \mathbf{h}_{B}^{(l-1)}\right) \\ &=\text { Linear }\left(\text { Concat }\left(\mathbf{W}^{(l)} \mathbf{h}_{A}^{(l-1)}, \mathbf{W}^{(l)} \mathbf{h}_{B}^{(l-1)}\right)\right) \end{aligned}$
Multi-head Attention
- Stabilize the learning process of attention mechanism
- Create multiple attention scores (each replica with a different set of parameters)
  
  $\begin{aligned} \mathbf{h}_{v}^{(l)}[1] &=\sigma\left(\sum_{u \in N(v)} \alpha_{v u}^{1} \mathbf{W}^{(l)} \mathbf{h}_{u}^{(l-1)}\right) \\ \mathbf{h}_{v}^{(l)}[2] &=\sigma\left(\sum_{u \in N(v)} \alpha_{v u}^{2} \mathbf{W}^{(l)} \mathbf{h}_{u}^{(l-1)}\right) \\ \mathbf{h}_{v}^{(l)}[3] &=\sigma\left(\sum_{u \in N(v)} \alpha_{v u}^{3} \mathbf{W}^{(l)} \mathbf{h}_{u}^{(l-1)}\right) \end{aligned}$
- Outputs are aggregated by concatenation or summation
  
  $\mathbf{h}_{v}^{(l)}=\mathrm{AGG}\left(\mathbf{h}_{v}^{(l)}[1], \mathbf{h}_{v}^{(l)}[2], \mathbf{h}_{v}^{(l)}[3]\right)$
- multi-head attention计算两个节点的attention score时，每个Head可以使用不同的函数

Design Space of Graph Neural Networks

A suggested GNN Layer
Batch Normalization
Dropout
- Dropout加在Transformation时的权重上
Activation

在这里插入图片描述

Stacking Layers of a GNN

Stacking Layers of a GNN

The Over-smoothing Problem (过度平滑)

The Issue of stacking many GNN layers: GNN suffers from the over-smoothing problem

The over-smoothing problem: all the node embeddings converge to the same value (所有节点的embeddings vector都收敛到相同的或非常相似的值，从而很难根据embedding vector进行分类)
Receptive Field of a GNN (GNN的感受野): the set of nodes that determine the embedding of a node of interest (决定一个节点的embedding vector的节点集)
- In a K-layer GNN, each node has a receptive field of K-hop neighborhood (一个K层的GNN，每个节点的感受野是K-hop以内的邻居节点)
  - 可以看到，3层GNN的感受野已经几乎覆盖了所有节点
- 下图显示了两个节点的共同感受野 (Receptive field overlap for two nodes)
  - The shared neighbors quickly grows when we increase the number of hops (num of GNN layers)
- We knew the embedding of a node is determined by its receptive field. If two nodes have highly-overlapped receptive fields, then their embeddings are highly similar (因为一个节点的embedding vector由其感受野内的节点所决定，如果两个节点的感受野重合度特别高，那么它们的embedding vector就会非常相似)
Stack many GNN layers ➡ nodes will have highly-overlapped receptive fields ➡ node embeddings will be highly similar ➡ suffer from the over-smoothing problem

Expressive Power for Shallow GNNs

让浅层GNN表征能力更强

在这里插入图片描述

Solution-1: Increase the expressive power within each GNN layer
- 之前的实现，对于聚合消息的变换只用了单层线性变换
- 可以使用更深层的DNN作变换，例如3-layer MLP，加强表达能力
Solution-2: Add layers that do not pass messages
- 整个GNN架构不一定仅仅包含GCN Layer
- we can add MLP layers (applied to each node) before and after GNN layers, as pre-process layers and post-process layers (可以在GNN Layer之前或之后加上MLP Layers（应用于每个节点），作为预处理层与后处理层)
  - Pre-processing layers: 当节点代表图片或文本时，先用DNN将其变换为向量是必须的操作
  - Post-processing layers: 当推理或变换（reasoning/transformation）节点embedding vector时很重要
- 一般来说，3层就可以了

Design GNN Layer Connectivity

Paper : Representation Learning on Graphs with Jumping Knowledge Networks

Paper : Residual Networks Behave Like Ensembles of Relatively Shallow Networks

设计GNN层间连接模式

Solution-1.1: Add skip connections in GNNs
- Observation from over-smoothing: Node embeddings in earlier GNN layers can sometimes better differentiate nodes (一些浅层的GNN层可以更好的区分不同的节点)
- 参考ResNet，加入Shortcut
- Why do skip connections work?
  - Skip-connection实际上创建了一种混合模型 (a mixture of models)
  - $N$ 个skip-connection可以产生 $2^N$ 个可能的前向传播路径
  - 每条路径最多包含 $N$ 个子模块
  - 实际上，自动产生了浅层与深层GNN的混合模型 (a mixture of shallow GNNs and deep GNNs)
  $\mathbf{h}_{v}^{(l)}=\sigma\left(\sum_{u \in N(v)} \mathbf{W}^{(l)} \frac{\mathbf{h}_{u}^{(l-1)}}{|N(v)|}+\mathbf{h}_{v}^{(l-1)}\right)$
Solution-1.2: Other options of skip connections in GNNs
- Directly skip to the last layer: The final layer directly aggregates from the all the node embeddings in the previous layers