GNN-CS224W: 8 Training GNN

最新推荐文章于 2023-04-25 21:54:39 发布

当客

最新推荐文章于 2023-04-25 21:54:39 发布

阅读量441

点赞数

文章标签：人工智能

本文链接：https://blog.csdn.net/u012492756/article/details/118269119

版权

Graph Augmentation

Raw input graph ≠ computational graph

之前提到的computational graph是完全根据node之间关系确定的，但是不一定非得这样做，可以通过其他方法获得computational graph

为什么不用原始的computational graph？

因为原始的computational graph存在各种问题：

The input graph lacks features
Graph structure problem
1. The graph is too sparse => inefficient message passing
2. The graph is too dense => message passing is too costly
3. The graph is too large => cannot fit the computational graph into a GPU

computational graph 需要修改的可能性有多大？

可能性很大，It’s unlikely that the input graph happens to be the optimal computation graph for embeddings

方法

Graph feature augmentation

Why do we need feature augmentation?

(1) Input graph does not have node features.

1. Assign constant values to every node

为什么要这样做？能增加什么样的信息？

视频里说aggragate时可以知道node在某一层aggrate了多少node，但是这个需要增加node feature吗？graph structure不就可以得到吗？

具体怎么表示？就用一个数字吗？

2. Assign unique IDs to nodes

在这里插入图片描述

the order of the node is arbitrary,

不能扩展new node和new graph

计算开销很大，node feature数量和node数量相同，所以非常数量非常大

两种方法对比

在这里插入图片描述
constant feature 扩展性强，但是几乎学不到structure以外的信息，one-hot能获得node specific feature，但是无法适用unseen node 和 large graph，两种特征都不太好。

(2) Certain structures are hard to learn by GNN

例如对一些chemical structure来说，node所在的环的长度是影响很大的，但是GNN无法学到环的长度

在这里插入图片描述

问题：虽然计算图一样，但是包含的节点不一样，从根节点往下第一次碰到node自己时不就知道cycle的长度了吗？

在这里插入图片描述

常见的特征

Node degree
Clustering coefficient
PageRank
Centrality

以及任何 traditional graph mechine learning 所涉及的特征

Graph structure augmentation

Add virtual nodes / edges

use when the graph is too sparse

when the graph is too sparse，message传递效率非常低，可能使用很多层GNN layer也无法aggragate 太多信息(例如一些node构成的长链，没有其他连接)。

(1) Add virtual edges

Common approach: Connect 2-hop neighbors via virtual edges, 就是在间接节点之间添加virtual edges

Intuition: Instead of using adjacency matrix $A$ for GNN computation, use $A+A^2$

优点：这样以前需要两层GNN layer才能aggregate的信息，现在一层就可以做到，可以让层数更少，训练起来更快

缺点：计算复杂度更高了

在这里插入图片描述

(2) Add virtual nodes

课程里只讲到了一种方法：
The virtual node will connect to all the nodes in the graph

这样所有的node的最长距离不会超过2，Greatly improves message passing in sparse graphs，适用于存在最短距离很远的node的graph
在这里插入图片描述

Node Neighborhood Sampling

when the graph is too dense时使用

例如社交网络里的大V，粉丝数量非常多，如果每个粉丝都aggragate起来，那计算量太大了。

Node Neighborhood Sampling就是随机选取一些重要的node

作用：牺牲一些node可能传递的信息，来换取计算量的减少；Allows for scaling to large graphs (什么是graph的scaling，目前还没讲到)

为了防止损失信息：可以每层/epoch 都sample不同的node，这样会让模型更robust，这样 In expectation, we get embeddings similar to the case where all the neighbors are used

In practice it works great

Sample subgraphs to compute embeddings

这一部分课程里还没讲到
The graph is too large

Training GNN

在这里插入图片描述

Prediction heads

Node-level

We can directly make prediction using node embeddings

Edge-level

Make prediction using pairs of node embeddings

$\hat{y}_{uv}=Head_{edge}(h_u^{(L)}, h_v^{(L)})$

(1) Concatenation + Linear

$\hat{y}_{uv}=Linear(Concat(h_u^{(L)}, h_v^{(L)}))$

(2) Dot product

1-way prediction

$\hat{y}_{uv}=(h_u^{(L)})^T \cdot h_v^{(L)}$

上式只能做1-way prediction (只能预测一种情况的是和否)，例如这条边是否存在

k-way prediction

$\hat{y}_{uv}^{(1)}=(h_u^{(L)})^T W^{(1)} h_v^{(L)}$
$\cdots$
$\hat{y}_{uv}^{(k)}=(h_u^{(L)})^T W^{(k)} h_v^{(L)}$

$\hat{y}_{uv}=Concat(\hat{y}_{uv}^{(1)}, \dots,\hat{y}_{uv}^{(k)})$

Negative sample and Ranking

讲到了除了训练时除了正例以外还要sample 一些负例，让负例的得分低，正例得分高；并使用Ranking模型

问题：

这一部分只在讲到RGCN 的link prediction时提到(第10节第1个视频最后一段)，没有完全理解，下图中的4 calculate metrics里的东西不理解，是ranking模型的方法吗？
怎么sample negative samples？要用全部可能的吗？

Graph-level

Make prediction using all the node embeddings in our graph
$\hat{y}_{G}=Head_{graph}(\{ h_v^{(L)}, \forall v \in G \})$

Global pooling

Global mean pooling

$\hat{y}_{G}=Mean(\{ h_v^{(L)}, \forall v \in G \})$

如果需要 ignore graph的node数量，可以使用
Global max pooling
$\hat{y}_{G}=Max(\{ h_v^{(L)}, \forall v \in G \})$

问题：怎么max？怎么定义node embedding的max？每个维度取max吗？
Global sum pooling
$\hat{y}_{G}=Sum(\{ h_v^{(L)}, \forall v \in G \})$

如果需要 emphasize graph的node数量，可以使用

Global pooling works great for small graphs

缺点

Global pooling over a (large) graph will lose information
在这里插入图片描述

Hierarchical Global Pooling

用community detection or graph partitioning algorithm 将graph分成clusters
最小的cluster先aggragate，再依次aggragate high level cluster

在这里插入图片描述
问题：实操怎么做？可以参考这一篇Hierarchical Graph Representation Learning with Differentiable Pooling

一种 Hierarchical Global Pooling 方法简述

在每个level 用两个独立的GNN：

GNN A: Compute node embeddings
GNN B: Compute the cluster that a node belongs to

For each Pooling layer, use clustering assignments from GNN B to aggregate node embeddings generated by GNN A

Create a single new node for each cluster, maintaining edges between clusters to generated a new pooled network

Jointly train GNN A and GNN B

GNNs A and B at each level can be executed in parallel

这里只是一个思路，具体要用时再去学细节

Supervised vs Unsupervised (self-supervised)

Supervised

Labels come from external sources
E.g., predict drug likeness of a molecular graph

Supervised labels come from the specific use cases

Advice: Reduce your task to node / edge / graph labels, since they are easy to work with, 因为这3个任务已经有很多前人的工作可以借鉴

E.g., we knew some nodes form a cluster. We can treat the cluster that a node belongs to as a node label

Unsupervised (self-supervised)

Signals come from graphs themselves

For example, we can let GNN predict the following:

Node-level: Node statistics: such as clustering coefficient, PageRank, …
Edge-level: Link prediction: hide the edge between two nodes, predict if there should be a link
Graph-level: Graph statistics: for example, predict if two graphs are isomorphic

Both

两个一起用
E.g., train a GNN to predict node clustering coefficient

Dataset Split

图像文本等数据集中样本之间是独立的，所以可以任意split dataset

graph 数据中node和node之间是有关系的，和图像文本不同

两种split方式

1 Transductive setting

graph structure 在train/dev/test set都可以观察到

only split the (node) labels, train/dev/test set中都包括了全部的node，区别是train/dev/test set中使用了不同node 的label

适用于 node/edge level task

2 Inductive setting

将graph分成独立的多个部分，train/dev/test分别使用独立的部分

有可能数据包括很多独立的graph，则只需从中分别选择train/dev/test set
有可能数据只有一个graph，则需要先分成独立的小graph，在分成train/dev/test set
1. 有可能只分成3个子graph，分别对应train/dev/test set
2. 也有可能需要分成多个子graph，在从中选取train/dev/test set

适用于 node/edge/graph level task

这样会损失一些edges

不适合small graph，因为small graph 数据量本来就少，去掉edge又去掉了一部分数据，少量数据训练很可能过拟合

问题：怎么分？怎么分成多个部分？

Example: Link Prediction

unsupervised / self-supervised task

hide some edges and the let the GNN predict if the edges exist

步骤

把所有edge分成2部分：
1. Message edges
  没有被hide的edge，作为模型的输入
2. Supervision edges
  被hide起来要预测的edge，作为模型的label
Split edges into train / validation / test
可以从上面讲到的2种split方式选一种：
1. Inductive link prediction split
  train/dev/test set中每个小graph都包括Message edges和Supervision edges
2. Transductive link prediction split
  这是Link Prediction 的默认设置
  如下图所示，train、dev、test时所使用的edge是递增的，所以需要将所有edge分成4部分：
  (1) Training message edges
  (2) Training supervision edges
  (3) Validation edges
  (4) Test edges
  
  Why do we use growing number of edges?
  After training, supervision edges are known to GNN. Therefore, an ideal model should use supervision edges in message passing at validation time. The same applies to the test time.