GraphSage 算法原理介绍与源码浅析
文章目录
前言
最近在做 Graph 相关的工作, 两年前做过一段时间, 想不到兜兜转转又回到最初的起点~🤣🤣🤣 工作继续稳步推进, 同时打算复习下基础算法. 论文也忒多了, 一段时间没看, 已经跟不上了 🤣🤣🤣
这里插句题外话, 之前我写的一些博客, 代码分析的太过细节了, 我自己平时翻看的时候, 都会直接将琐碎的东西给略过. 从这一行为可以看出, 之前博客中记录了太多冗余的内容, 不仅在记录时浪费了时间, 更给后续查阅带来了一些阻碍. 鉴于此, 以后做代码分析打算尽力只分析源码的核心部分, 再加上部分感兴趣的内容.
广而告之
可以在微信中搜索 “珍妮的算法之路” 或者 “world4458” 关注我的微信公众号;另外可以看看知乎专栏 PoorMemory-机器学习, 以后文章也会发在知乎专栏中;
文章信息
- 论文标题: Inductive Representation Learning on Large Graphs
- 论文地址: https://arxiv.org/pdf/1706.02216.pdf
- 代码地址: https://github.com/williamleif/GraphSAGE
- 发表时间: NIPS, 2017
- 论文作者: William L. Hamilton, Rex Ying, Jure Leskovec
- 作者单位: Stanford
补充: 在 全面理解 PinSage 文章中详细介绍了 PinSage 算法, GraphSage 算法是 PinSage 的理论基础, 而 PinSage 包含了很多工程上的实践经验, 两者可以结合起来看看.
核心观点
GraphSage (Graph SAmple and aggreGatE) 属于 Inductive learning 算法, 它学习一种聚合函数, 通过聚合节点邻居的特征信息来学习目标节点本身的 embedding 表达. 从它的名字中可以看出算法的核心步骤分别是邻居采样以及特征聚合; GraphSage 就是我们通常意义的机器学习任务, 对于未知的节点具有泛化能力, 它和 Transductive Learning 算法 (如 GCN, DeepWalk, 在固定的图结构上学习节点的 embedding) 不同的是, Transductive Learning 算法在图中加入新节点后, 需要将模型重新训练.
核心观点解读
在介绍具体的算法之前, 先简要对比一下 Inductive learning 与 Transductive learning. 关于它们的详细介绍推荐阅读文章 Inductive vs. Transductive Learning.
其中:
Inductive learning is the same as what we commonly know as traditional supervised learning. We build and train a machine learning model based on a labelled training dataset we already have. Then we use this trained model to predict the labels of a testing dataset which we have never encountered before.
In contrast to inductive learning, transductive learning techniques have observed all the data beforehand, both the training and testing datasets. We learn from the already observed training dataset and then predict the labels of the testing dataset. Even though we do not know the labels of the testing datasets, we can make use of the patterns and additional information present in this data during the learning process.
The main difference is that during transductive learning, you have already encountered both the training and testing datasets when training the model. However, inductive learning encounters only the training data when training the model and applies the learned model on a dataset which it has never seen before.
Transduction does not build a predictive model. If a new data point is added to the testing dataset, then we will have to re-run the algorithm from the beginning, train the model and then use it to predict the labels. On the other hand, inductive learning builds a predictive model. When you encounter new data points, there is no need to re-run the algorithm from the beginning.
(Inductive Learning 被翻译为归纳式学习, Transductive Leanring 为直推式学习. 说实话, 这两个翻译把我整迷糊了, 从来没有记住过, 但是上面的英文释义却非常好记, 不容易忘🤣🤣🤣)
GraphSage 属于 Inductive learning 算法, 它学习一种聚合函数, 通过聚合节点邻居的特征信息来学习目标节点本身的 embedding 表达. 它的主要步骤就记录在它的名字中: Sample 与 Aggregate. 其中 Sample 阶段通过随机采样获取多跳邻居; Aggregate 阶段聚合邻居节点特征生成目标节点自身的 embedding. 以聚合 2 跳邻居为例, 它将首先聚合 2 跳邻居的特征生成 1 跳邻居的 embedding, 之后再聚合 1 跳邻居的 embedding 来生成节点本身的 embedding. 由于生成 1 跳邻居 embedding 时, 已经包含了 2 跳邻居的特征信息, 此时目标节点也将获得 2 跳邻居的特征信息. 论文中的图示形象地展示了这一过程:
生成完目标节点的 embedding 后, 可以提供给下游的机器学习系统做诸如节点分类的预估任务.
GraphSage 的前向传播算法如下图:
第一个 for 循环针对层数进行遍历, 第二个 for 循环用于遍历 Graph 中的所有节点, 针对每个节点 v v v, 对邻居进行采样得到 N ( v ) \mathcal{N}(v) N(v), 并通过 AGGREGATE k ( ⋅ ) \text{AGGREGATE}_k(\cdot) AGGREGATEk(⋅) 对邻居节点的 embedding 进行聚合, 得到 h N ( v ) k \mathbf{h}_{\mathcal{N}(v)}^{k} hN(v)k, 再将它与目标节点当前的 embedding h v k − 1 \mathbf{h}_{v}^{k - 1} hvk−1 进行拼接, 经过非线性变换后赋给 h v k \mathbf{h}_{v}^{k} hvk, 从而完成目标节点 v v v 的一次更新. 当外层的 for 循环 ( k = 1 … K k= 1\ldots K k=1…K) 遍历结束时, 节点 v v v 将完成 K K K 跳邻居的信息聚合.
在具体代码实现时, 实际上采用的是 minibatch 的形式, 论文 Appendix A 进行了介绍, 待会在源码分析中也将进行描述.
源码分析
本次分析的代码位于 https://github.com/williamleif/GraphSAGE, 是官方开源的 TensorFlow 版本.
GraphSage 的核心在于 Sample 和 Aggregate. 由于训练模型时, 我们一般采用 minibatch 的方式进行训练, 因此在论文的 Appendix A 中, 还给出了一份 minibatch 版本的伪代码, 如下:
其中代码 1 ~ 7 行表示对邻居进行采样, 而 8 ~ 15 行表示邻居聚合.
在 GraphSage 的代码中, 邻居采样以及聚合代码均位于 https://github.com/williamleif/GraphSAGE/blob/master/graphsage/models.py 文件中, 在进行介绍之前, 需要解释一个会令人困惑的点. 作者对于 Graph 中每层节点的采样个数设置如下:
flags.DEFINE_integer('samples_1', 25, 'number of samples in layer 1')
flags.DEFINE_integer('samples_2', 10, 'number of samples in layer 2')
实际上表达的含义如下图:
注意图中的 layer 1
层采样的节点数为 10, 而 layer 2
层采样的节点数为 25
, 刚好和代码中的定义相反. 关于这一点作者在 Appendix A 中介绍伪代码的下方进行了说明, 而且注意到上面伪代码的第一行, 令
B
K
←
B
\mathcal{B}^K\leftarrow\mathcal{B}
BK←B, 一开始就将目标节点赋值给
B
K
\mathcal{B}^K
BK, 采样的时候是按
k
=
K
,
…
,
1
k = K, \ldots, 1
k=K,…,1 的顺序进行遍历 (伪代码第 2 行), 而聚合时则是按
k
=
1
,
…
,
K
k = 1,\ldots, K
k=1,…,K 的顺序进行遍历 (伪代码第 9 行).
看源码时如果不注意这一点, 容易有些困惑. 为了方便介绍, 后续我就拿具体的数字, 比如 10, 25 之类的来说明代码含义, 这样可以快速判断当前在 Graph 中的第几层.
邻居采样
GraphSage 邻居采样代码定义如下:
def sample(self, inputs, layer_infos, batch_size=None):
""" Sample neighbors to be the supportive fields for multi-layer convolutions.
Args:
inputs: batch inputs
batch_size: the number of inputs (different for batch inputs and negative samples).
"""
if batch_size is None:
batch_size = self.batch_size
samples = [inputs]
# size of convolution support at each layer per node
support_size = 1
support_sizes = [support_size]
for k in range(len(layer_infos)):
t = len(layer_infos) - k - 1
support_size *= layer_infos[t].num_samples
sampler = layer_infos[t].neigh_sampler
node = sampler((samples[k], layer_infos[t].num_samples))
samples.append(tf.reshape(node, [support_size * batch_size,]))
support_sizes.append(support_size)
return samples, support_sizes
阅读时不需要太在意实现细节 (比如 k k k 与 t t t 的关系), 因为了解原理之后可以很轻松写出来. 首先该函数传入:
inputs
: 大小为[B,]
的 Tensor, 表示目标节点的 ID;layer_infos
: 假设 Graph 深度为 K K K, 那么layer_infos
的大小为 K − 1 K - 1 K−1, 保存 Graph 中每一层的相关信息, 比如采样的邻居数num_samples
, 采样方法neigh_sampler
等.
由于从目标节点开始采样, 采样结束后:
samples
保存 3 个 Tensor, 大小为:[Tensor(B*1,), Tensor(B*10,), Tensor(B*250,)]
, 表示 Graph 中每一层的节点 idsupport_sizes
为[1, 10, 250]
, 表示对每一个目标节点, 它在 Graph 中每一层的邻居个数.
采样方法
采样方法 sampler
定义在: https://github.com/williamleif/GraphSAGE/blob/master/graphsage/neigh_samplers.py, 代码如下:
class UniformNeighborSampler(Layer):
"""
Uniformly samples neighbors.
Assumes that adj lists are padded with random re-sampling
"""
def __init__(self, adj_info, **kwargs):
super(UniformNeighborSampler, self).__init__(**kwargs)
## 假设 Graph 中的节点个数为 N, 设置 Graph 的最大出度为 max_degreee,
## 那么 adj_info 的大小为 [N + 1, max_degree],
## 它为邻接矩阵, 记录每个节点的邻居对应的 node_id
self.adj_info = adj_info
def _call(self, inputs):
## 设 ids 大小为 [B,]
ids, num_samples = inputs
## 获取 ids 对应的邻居节点, adj_lists 为 [B, max_degree]
adj_lists = tf.nn.embedding_lookup(self.adj_info, ids)
## 对邻居进行采样, 由于 tf.random_shuffle 是在 axis=0 的维度
## 上进行 shuffle, 因此先对 adj_lists 做个 transpose 的操作,
## shuffle 结束后再变换回来. 最后用 tf.slice 选出 num_samples 个邻居
adj_lists = tf.transpose(tf.random_shuffle(tf.transpose(adj_lists)))
adj_lists = tf.slice(adj_lists, [0,0], [-1, num_samples])
return adj_lists
对其进行调用时传入 inputs
, 包含目标节点的 ids
以及采样个数 num_samples
, 最后返回大小为 [B, num_samples]
的 Tensor. 另外注意邻接矩阵的生成也包括放回采样以及不放回采样, 具体见作者源码, 这里不过多介绍. (详见 https://github.com/williamleif/GraphSAGE/blob/master/graphsage/minibatch.py#L76 中 construct_adj
函数)
邻居聚合
邻居聚合代码位于: https://github.com/williamleif/GraphSAGE/blob/master/graphsage/models.py, 定义如下 (只保留了核心的代码):
def aggregate(self, samples, input_features, dims, num_samples, support_sizes, batch_size=None,
aggregators=None, name=None, concat=False, model_size="small"):
## hidden 为 [Tensor(B, 1, E), Tensor(B, 10, E), Tensor(B, 250, E)]
hidden = [tf.nn.embedding_lookup(input_features, node_samples) for node_samples in samples]
for layer in range(len(num_samples)):
## ......
# hidden representation at current layer for all support nodes that are various hops away
next_hidden = []
# as layer increases, the number of support nodes needed decreases
for hop in range(len(num_samples) - layer):
dim_mult = 2 if concat and (layer != 0) else 1
neigh_dims = [batch_size * support_sizes[hop],
num_samples[len(num_samples) - hop - 1],
dim_mult*dims[layer]]
h = aggregator((hidden[hop],
tf.reshape(hidden[hop + 1], neigh_dims)))
next_hidden.append(h)
hidden = next_hidden
return hidden[0], aggregators
由于之前提到的, 作者在采样的时候是按 k = K , … , 1 k = K, \ldots, 1 k=K,…,1 的顺序进行遍历 (伪代码第 2 行), 而聚合时则是按 k = 1 , … , K k = 1,\ldots, K k=1,…,K 的顺序进行遍历, 导致代码看起来有点绕. 这部分代码结合如下图示来理解会方便许多.
聚合方法
关于聚合方法, 主要定义在: https://github.com/williamleif/GraphSAGE/blob/master/graphsage/aggregators.py
MeanAggregator
定义如下:
class MeanAggregator(Layer):
"""
Aggregates via mean followed by matmul and non-linearity.
"""
def __init__(self, input_dim, output_dim, neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu,
name=None, concat=False, **kwargs):
super(MeanAggregator, self).__init__(**kwargs)
## ......
def _call(self, inputs):
## self_vecs: [B, E]
## neigh_vecs: [B, H, E], H 为邻居节点的个数
self_vecs, neigh_vecs = inputs
neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
neigh_means = tf.reduce_mean(neigh_vecs, axis=1)
# [nodes] x [out_dim]
from_neighs = tf.matmul(neigh_means, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
对 neigh_vecs
邻居节点的 embedding 进行 mean pooling 后, 再和目标节点本身的 embedding 进行相加或者拼接.
GCNAggregator
class GCNAggregator(Layer):
"""
Aggregates via mean followed by matmul and non-linearity.
Same matmul parameters are used self vector and neighbor vectors.
"""
def __init__(self, input_dim, output_dim, neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(GCNAggregator, self).__init__(**kwargs)
## ......
def _call(self, inputs):
## self_vecs: [B, E]
## neigh_vecs: [B, H, E], H 为邻居节点的个数
self_vecs, neigh_vecs = inputs
neigh_vecs = tf.nn.dropout(neigh_vecs, 1-self.dropout)
self_vecs = tf.nn.dropout(self_vecs, 1-self.dropout)
means = tf.reduce_mean(tf.concat([neigh_vecs,
tf.expand_dims(self_vecs, axis=1)], axis=1), axis=1)
# [nodes] x [out_dim]
output = tf.matmul(means, self.vars['weights'])
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
先使用 tf.expand_dims(self_vecs, axis=1)
展开成 [B, 1, E]
的形式, 再和 neigh_vecs
进行 concat, 最后整体求 mean;
MaxPoolingAggregator
class MaxPoolingAggregator(Layer):
""" Aggregates via max-pooling over MLP functions.
"""
def __init__(self, input_dim, output_dim, model_size="small", neigh_input_dim=None,
dropout=0., bias=False, act=tf.nn.relu, name=None, concat=False, **kwargs):
super(MaxPoolingAggregator, self).__init__(**kwargs)
## ............
def _call(self, inputs):
self_vecs, neigh_vecs = inputs
neigh_h = neigh_vecs
dims = tf.shape(neigh_h)
batch_size = dims[0]
num_neighbors = dims[1]
# [nodes * sampled neighbors] x [hidden_dim]
h_reshaped = tf.reshape(neigh_h, (batch_size * num_neighbors, self.neigh_input_dim))
for l in self.mlp_layers:
h_reshaped = l(h_reshaped)
neigh_h = tf.reshape(h_reshaped, (batch_size, num_neighbors, self.hidden_dim))
neigh_h = tf.reduce_max(neigh_h, axis=1)
from_neighs = tf.matmul(neigh_h, self.vars['neigh_weights'])
from_self = tf.matmul(self_vecs, self.vars["self_weights"])
if not self.concat:
output = tf.add_n([from_self, from_neighs])
else:
output = tf.concat([from_self, from_neighs], axis=1)
# bias
if self.bias:
output += self.vars['bias']
return self.act(output)
代码中使用 neigh_h = tf.reduce_max(neigh_h, axis=1)
对邻居 embedding 进行聚合.
其他还有 TwoMaxLayerPoolingAggregator, MeanPoolingAggregator 以及 SeqAggregator (实现 LSTM Aggregator) 就不多分析, 后续有需要的时候再看.
总结
国庆快乐~