Graph Augmentation
Raw input graph ≠ computational graph
之前提到的computational graph是完全根据node之间关系确定的,但是不一定非得这样做,可以通过其他方法获得computational graph
为什么不用原始的computational graph?
因为原始的computational graph存在各种问题:
- The input graph lacks features
- Graph structure problem
- The graph is too sparse => inefficient message passing
- The graph is too dense => message passing is too costly
- The graph is too large => cannot fit the computational graph into a GPU
computational graph 需要修改的可能性有多大?
可能性很大,It’s unlikely that the input graph happens to be the optimal computation graph for embeddings
方法
Graph feature augmentation
Why do we need feature augmentation?
(1) Input graph does not have node features.
1. Assign constant values to every node
为什么要这样做?能增加什么样的信息?
视频里说aggragate时可以知道node在某一层aggrate了多少node,但是这个需要增加node feature吗?graph structure不就可以得到吗?
具体怎么表示?就用一个数字吗?
2. Assign unique IDs to nodes
the order of the node is arbitrary,
不能扩展new node和new graph
计算开销很大,node feature数量和node数量相同,所以非常数量非常大
两种方法对比
constant feature 扩展性强,但是几乎学不到structure以外的信息,one-hot能获得node specific feature,但是无法适用unseen node 和 large graph,两种特征都不太好。
(2) Certain structures are hard to learn by GNN
例如 对一些chemical structure来说,node所在的环的长度是影响很大的,但是GNN无法学到环的长度
问题:虽然计算图一样,但是包含的节点不一样,从根节点往下第一次碰到node自己时不就知道cycle的长度了吗?
常见的特征
- Node degree
- Clustering coefficient
- PageRank
- Centrality
以及任何 traditional graph mechine learning 所涉及的特征
Graph structure augmentation
Add virtual nodes / edges
use when the graph is too sparse
when the graph is too sparse,message传递效率非常低,可能使用很多层GNN layer也无法aggragate 太多信息(例如一些node构成的长链,没有其他连接)。
(1) Add virtual edges
Common approach: Connect 2-hop neighbors via virtual edges, 就是在间接节点之间添加virtual edges
Intuition: Instead of using adjacency matrix A A A for GNN computation, use A + A 2 A+A^2 A+A2
优点:这样以前需要两层GNN layer才能aggregate的信息,现在一层就可以做到,可以让层数更少,训练起来更快
缺点:计算复杂度更高了
(2) Add virtual nodes
课程里只讲到了一种方法:
The virtual node will connect to all the nodes in the graph
这样所有的node的最长距离不会超过2,Greatly improves message passing in sparse graphs,适用于存在最短距离很远的node的graph
Node Neighborhood Sampling
when the graph is too dense时使用
例如社交网络里的大V,粉丝数量非常多,如果每个粉丝都aggragate起来,那计算量太大了。
Node Neighborhood Sampling就是随机选取一些重要的node
作用:牺牲一些node可能传递的信息,来换取计算量的减少;Allows for scaling to large graphs (什么是graph的scaling,目前还没讲到)
为了防止损失信息:可以每层/epoch 都sample不同的node,这样会让模型更robust,这样 In expectation, we get embeddings similar to the case where all the neighbors are used
In practice it works great
Sample subgraphs to compute embeddings
这一部分课程里还没讲到
The graph is too large
Training GNN
Prediction heads
Node-level
We can directly make prediction using node embeddings
Edge-level
Make prediction using pairs of node embeddings
y ^ u v = H e a d e d g e ( h u ( L ) , h v ( L ) ) \hat{y}_{uv}=Head_{edge}(h_u^{(L)}, h_v^{(L)}) y^uv=Headedge(hu(L),hv(L))
(1) Concatenation + Linear
y ^ u v = L i n e a r ( C o n c a t ( h u ( L ) , h v ( L ) ) ) \hat{y}_{uv}=Linear(Concat(h_u^{(L)}, h_v^{(L)})) y^uv=Linear(Concat(hu(L),hv(L)))
(2) Dot product
1-way prediction
y ^ u v = ( h u ( L ) ) T ⋅ h v ( L ) \hat{y}_{uv}=(h_u^{(L)})^T \cdot h_v^{(L)} y^uv=(hu(L))T⋅hv(L)
上式只能做1-way prediction (只能预测一种情况的是和否),例如这条边是否存在
k-way prediction
y
^
u
v
(
1
)
=
(
h
u
(
L
)
)
T
W
(
1
)
h
v
(
L
)
\hat{y}_{uv}^{(1)}=(h_u^{(L)})^T W^{(1)} h_v^{(L)}
y^uv(1)=(hu(L))TW(1)hv(L)
⋯
\cdots
⋯
y
^
u
v
(
k
)
=
(
h
u
(
L
)
)
T
W
(
k
)
h
v
(
L
)
\hat{y}_{uv}^{(k)}=(h_u^{(L)})^T W^{(k)} h_v^{(L)}
y^uv(k)=(hu(L))TW(k)hv(L)
y ^ u v = C o n c a t ( y ^ u v ( 1 ) , … , y ^ u v ( k ) ) \hat{y}_{uv}=Concat(\hat{y}_{uv}^{(1)}, \dots,\hat{y}_{uv}^{(k)}) y^uv=Concat(y^uv(1),…,y^uv(k))
Negative sample and Ranking
讲到了除了训练时除了正例以外还要sample 一些负例,让负例的得分低,正例得分高;并使用Ranking模型
问题:
- 这一部分只在讲到RGCN 的link prediction时提到(第10节第1个视频最后一段),没有完全理解,下图中的4 calculate metrics里的东西不理解,是ranking模型的方法吗?
- 怎么sample negative samples?要用全部可能的吗?
Graph-level
Make prediction using all the node embeddings in our graph
y
^
G
=
H
e
a
d
g
r
a
p
h
(
{
h
v
(
L
)
,
∀
v
∈
G
}
)
\hat{y}_{G}=Head_{graph}(\{ h_v^{(L)}, \forall v \in G \})
y^G=Headgraph({hv(L),∀v∈G})
Global pooling
-
Global mean pooling
y ^ G = M e a n ( { h v ( L ) , ∀ v ∈ G } ) \hat{y}_{G}=Mean(\{ h_v^{(L)}, \forall v \in G \}) y^G=Mean({hv(L),∀v∈G})
如果需要 ignore graph的node数量,可以使用
-
Global max pooling
y ^ G = M a x ( { h v ( L ) , ∀ v ∈ G } ) \hat{y}_{G}=Max(\{ h_v^{(L)}, \forall v \in G \}) y^G=Max({hv(L),∀v∈G})问题:怎么max?怎么定义node embedding的max?每个维度取max吗?
-
Global sum pooling
y ^ G = S u m ( { h v ( L ) , ∀ v ∈ G } ) \hat{y}_{G}=Sum(\{ h_v^{(L)}, \forall v \in G \}) y^G=Sum({hv(L),∀v∈G})如果需要 emphasize graph的node数量,可以使用
Global pooling works great for small graphs
缺点
Global pooling over a (large) graph will lose information
Hierarchical Global Pooling
- 用community detection or graph partitioning algorithm 将graph分成clusters
- 最小的cluster先aggragate,再依次aggragate high level cluster
问题:实操怎么做?可以参考这一篇Hierarchical Graph Representation Learning with Differentiable Pooling
一种 Hierarchical Global Pooling 方法简述
在每个level 用两个独立的GNN:
-
GNN A: Compute node embeddings
-
GNN B: Compute the cluster that a node belongs to
For each Pooling layer, use clustering assignments from GNN B to aggregate node embeddings generated by GNN A
Create a single new node for each cluster, maintaining edges between clusters to generated a new pooled network
Jointly train GNN A and GNN B
GNNs A and B at each level can be executed in parallel
这里只是一个思路,具体要用时再去学细节
Supervised vs Unsupervised (self-supervised)
Supervised
Labels come from external sources
E.g., predict drug likeness of a molecular graph
Supervised labels come from the specific use cases
Advice: Reduce your task to node / edge / graph labels, since they are easy to work with, 因为这3个任务已经有很多前人的工作可以借鉴
E.g., we knew some nodes form a cluster. We can treat the cluster that a node belongs to as a node label
Unsupervised (self-supervised)
Signals come from graphs themselves
For example, we can let GNN predict the following:
- Node-level: Node statistics: such as clustering coefficient, PageRank, …
- Edge-level: Link prediction: hide the edge between two nodes, predict if there should be a link
- Graph-level: Graph statistics: for example, predict if two graphs are isomorphic
Both
两个一起用
E.g., train a GNN to predict node clustering coefficient
Dataset Split
图像文本等数据集中样本之间是独立的,所以可以任意split dataset
graph 数据中node和node之间是有关系的,和图像文本不同
两种split方式
1 Transductive setting
graph structure 在train/dev/test set都可以观察到
only split the (node) labels, train/dev/test set中都包括了全部的node,区别是train/dev/test set中使用了不同node 的label
适用于 node/edge level task
2 Inductive setting
将graph分成独立的多个部分,train/dev/test分别使用独立的部分
- 有可能数据包括很多独立的graph,则只需从中分别选择train/dev/test set
- 有可能数据只有一个graph,则需要先分成独立的小graph,在分成train/dev/test set
- 有可能只分成3个子graph,分别对应train/dev/test set
- 也有可能需要分成多个子graph,在从中选取train/dev/test set
适用于 node/edge/graph level task
这样会损失一些edges
不适合small graph,因为small graph 数据量本来就少,去掉edge又去掉了一部分数据,少量数据训练很可能过拟合
问题:怎么分?怎么分成多个部分?
Example: Link Prediction
unsupervised / self-supervised task
hide some edges and the let the GNN predict if the edges exist
步骤
-
把所有edge分成2部分:
- Message edges
没有被hide的edge,作为模型的输入 - Supervision edges
被hide起来要预测的edge,作为模型的label
- Message edges
-
Split edges into train / validation / test
可以从上面讲到的2种split方式选一种:- Inductive link prediction split
train/dev/test set中每个小graph都包括Message edges和Supervision edges - Transductive link prediction split
这是Link Prediction 的默认设置
如下图所示,train、dev、test时所使用的edge是递增的,所以需要将所有edge分成4部分:
(1) Training message edges
(2) Training supervision edges
(3) Validation edges
(4) Test edges
Why do we use growing number of edges?
After training, supervision edges are known to GNN. Therefore, an ideal model should use supervision edges in message passing at validation time. The same applies to the test time.
- Inductive link prediction split