CS224W摘要12.Frequent Subgraph Mining with GNNs

最新推荐文章于 2022-12-17 11:14:22 发布

oldmao_2000

最新推荐文章于 2022-12-17 11:14:22 发布

阅读量1.2k

点赞数 2

分类专栏： CS224W（完结）文章标签： Motif 子图 GNN 图同构计算

本文链接：https://blog.csdn.net/oldmao_2001/article/details/120567527

版权

CS224W（完结）专栏收录该内容

19 篇文章 13 订阅

订阅专栏

文章目录

Subgraph and Motifs
Neural Subgraph Matching / Representations
Mining / Finding Frequent Motifs / Subgraphs

CS224W: Machine Learning with Graphs
公式输入请参考：在线Latex公式
本节课三部分内容
子图的概念和定义，其重要性指标。
子图的表征，如果用GNN的方式进行子图的表征
子图频率的挖掘

Subgraph and Motifs

先要明确子图的概念，就好比乐高中的模块一样，不能切分太小，又不能切分太多，如何识别出图中的子图模块。

Definitionn of Subgraphs

有两种。Given graph 𝑮 = (𝑽, 𝑬)
选择的方式主要看应用，例如：
Chemistry: node-induced (functional groups)
Knowledge graphs: Often edge-induced (focus is on edges representing logical relations)

Node-induced subgraph

Take subset of the nodes and all edges induced by the nodes:
$G^{'} = (V^{'}, E^{'})$ is a node induced subgraph iff
$𝑉'\nsubseteq V$
$𝐸′ = \{(𝑢, 𝑣) ∈ 𝐸 | 𝑢, 𝑣 ∈ 𝑉'\}$
$G^{'}$ is the subgraph of 𝐺 induced by $V^{'}$
也叫：induced subgraph由节点集决定的子图

Edge-induced subgraph

Take subset of the edges and all corresponding nodes
$G^{'} = (V^{'}, E^{'})$ is an edge induced subgraph iff
$E^{'} \subseteq E$
$𝑉'=\{𝑣∈𝑉|(𝑣,𝑢)∈𝐸′\text{ for some }𝑢\}$
也叫：“non-induced subgraph” or just “subgraph”，较少用

contained subgraph

上面两种子图定义都是针对从源图中取的子图，如果节点或者边是来自不同的图：
在这里插入图片描述
那么我们把两个图称为包含关系。确定包含关系的关键之处就是确定图是否同构。

Graph isomorphism

由于节点的无序性，同构可能是NP-hard的问题（未证明），具体定义如下：
如果两个图存在一个双射函数（bijection function） $f$ ，使得：
$(u,v)\in E_1 \text{ iff }(f(a),f(b))\in E_2$
$f$ 就是同构关系。
在这里插入图片描述
子图的同构问题与上面的包含关系等价。这个是NP-hard的问题

$G_2$ is subgraph-isomorphic to $G_1$ if some subgraph of $G_2$ is isomorphic to $G_1$ .
在这里插入图片描述
可以看到函数 $f$ 不是唯一的的，这里的点顺序可以换一下，不影响结果。

然后给出不同节点数量（大小）的非同构的子图例子（3个节点有向）：
在这里插入图片描述
4个节点无向：

Determining Motif Significance

有了上面的结论，可以继续往下推，就是一个图结构可以分解为上面的一个个小的子图构成，构成图结构的一个个小的子图就是Motif

Network motifs

概念：recurring, significant patterns of interconnections
实际生活中的例子就是（图片来自百度）：
在这里插入图片描述
因此得到motif的三个特征：
Pattern: Small (node-induced) subgraph
Recurring: Found many times, i.e., with high frequency
Significant: More frequent than expected, i.e., in randomly generated graphs?

一个motif匹配的例子（induced表示是基于节点的）：
在这里插入图片描述

在这里插入图片描述

Motifs的作用

Help us understand how graphs work.了解图的工作机制
Help us make predictions based on presence or lack of presence in a graph dataset.做主题预测，某个主题往往包含特定的motif，例如社交网络中往往是三角形的motif
例子：
Feed-forward loops: found in networks of neurons, where they neutralize “biological noise”
在这里插入图片描述
Parallel loops: found in food webs（食物链？）

Single-input modules: found in gene control networks

subgraph frequency

$G_Q$ 是小图， $v$ 是小图中的某个锚点， $G_T$ 是目标图
Graph-level Subgraph Frequency Definition.
找出 $G_Q$ 在 $G_T$ 中出现的频率：number of unique subsets of nodes $V_T$ of $G_T$ for which the subgraph of $G_T$ induced by the nodes $V_T$ is isomorphic to $G_Q$
例子：
在这里插入图片描述

Node-level Subgraph Frequency Definition:
The number of nodes $u$ in $G_T$ for which some subgraph of $G_T$ is isomorphic to $G_Q$ and the isomorphism maps $u$ to $v$ .
记 $C_Q,v)$ 为node-anchored subgraph。这种方法对outlier的鲁棒性较好。
例如：
在这里插入图片描述

可以看到中心点只算了一次。
如果 $G_T$ 是多个图，则可以将不同的图看成同一个图的非连通子图进行计算。

Motif significance（重要性/显著性）

意译就是Motif的重要性，主要体现在：现实世界中的图比起随机生成的图具有更加functional significance。
在这里插入图片描述

随机图生成法1：ER

ER随机图可以定义为： $G_{n,p}$ ，意思是含有 $n$ 个节点的无向图中，每条边 $(u, v)$ 以独立同分布的概率 $p$ 出现，例如：
在这里插入图片描述

随机图生成法2：Configuation model

这里的思路感觉有点断断续续，不怎么连贯，一下子突然冒出很多专有名词。
配置模型的目标是以给定的度序列： $k_1,k_2,\cdots,k_N$ 生成一个随机图。他可以作为网络的null model，用来比较具有相同度序列的真实图 $G^{real}$ 和随机图 $G^{rand}$

Null model的定义：就是随机图，但每个随机图与真实图有相同的节点数量、边数量、度分布相同。
下面是一个configuration model的例子，可以看到节点的度分别是3421，图中叫spokes（条幅），
在这里插入图片描述

然后对节点进行两两配对（随机的）
在这里插入图片描述
得到生成的图，生成过程中如有重复的边则忽略（上面AB就出现两次，但是结果中只有一条边），有自循环也忽略：

随机图生成法3：Switching

就是交换法，是生成随机图的另外一种方法。
1.给定图 $G$
在这里插入图片描述

2.重复以下步骤 $Q\cdot |E|$ 次
2.1随机选两条边，例如AB和CD
在这里插入图片描述
如果没有重复边或自环，交换选中两条边的终点变成：AD和CB

相当于在保证度不变的情况下，随机重新驳接图中的边。
通常 $Q = 100$ ，该算法可以收敛。

Motif significance 计算步骤

由于在真实图中Motif 出现次数要比随机图多（overrepresented），因此我们的思路就是用随机图来和实际图进行Motif的频次统计比较，步骤如下：
1: Count motifs in the given graph $G^{real}$ .
2: Generate random graphs with similar statistics (e.g. number of nodes, edges, degree sequence), and count motifs in the random graphs.
3:Use statistical measures(Z-score) to evaluate how significant is each motif

Z-score for statistical significance

Z-score计算公式：
$Z_i=\cfrac{(N_i^{real}-\bar N_i^{rand})}{std(N_i^{rand})}$
然后可以得到Network significance profile (SP):
$SP_i=\cfrac{Z_i}{\sqrt{\sum_jZ^2_j}}$
𝑆𝑃 is a vector of normalized Z-scores
The dimension depends on number of motifs considered
𝑆𝑃 emphasizes relative significance of subgraphs:
§ Important for comparison of networks of different sizes
§ Generally, larger graphs display higher Z-scores
SP是归一化后的Z-score，可以用于所有类型的子图；Z-score则是用于衡量不同类型的子图，例如：
Negative values indicate under-representation
Positive values indicate over-representation

example of significance profile

不同领域的motif有不同的特点，给自领域都有相似的sp
在这里插入图片描述
这个图的横坐标是13种motif，纵坐标是四大类不同网络中13种motif出现的SP值，SP越大出现频率越高。
比较明显的例子就是在第三类涵盖社交网络的图中，三角形的13号motif出现最多，因为：朋友的朋友基本也相互认识；相对的6号motif就出现得比较少（under represent）（保养小三？）。

Motif扩展及概念变体

扩展
§ Directed and undirected
§ Colored and uncolored相当于考虑不同节点类型，如下图：
在这里插入图片描述
§ Temporal and static motifs（时空、静态图）

变体：
§ Different frequency concepts
§ Different significance metrics
§ Under-Representation (anti-motifs)
§ Different null models
在这里插入图片描述

Neural Subgraph Matching / Representations

子图匹配任务描述
目标图可以是不联通的(can be disconnected)，查询子图必须是联通的(connected)。
然后把这个任务转化为机器学习的二分类问题：
Return True if query is isomorphic to a subgraph of the target graph, else return False.
在这里插入图片描述
注意：这个转化相对于找出上图中虚线的对应关系比较简单，这个找对应关系先挖坑，后面有讲。

整体流程

1、将输入的图进行分解得到子图
2、求分解后的子图的embedding
3、将查询子图和分解子图embedding进行匹配，做预测
在这里插入图片描述
下面看分解步骤讲解。

Neural architecture for subgraphs

在进行子图比较前，先要给出几个Neural architecture常用定义：
1.node-anchored definitions:要以anchor为基准进行比对
在这里插入图片描述
node-anchored neighborhoods:基于锚点可以得到锚点的n跳邻域信息

上面两个例子看起来就是GNN思想：

也就是可以用GNN来计算两个锚点的embedding，并用embedding来判断两个锚点的邻域是否同构；另外还可以通过这个方法来得到两个锚点的对应（mapping）关系。

$G_T$ 的分解

实际上就是获取图中每一个节点的邻域表示。
For each node in $𝑮_𝑻$ :
§ Obtain a k-hop neighborhood around the anchor
§ Can be performed using breadth-first search (BFS)
§ The depth 𝑘 is a hyper-parameter (e.g. 3)
注：Larger depth results in more expensive model
以上操作在查询子图 $G_Q$ 上也玩一次

Order embedding space

有序特征空间？
将图A映射到高维（64维）的特征空间，得到 $Z_A$ 。这里假设所有维度都是非负的。这样可以捕获到图表征的Ordering(transitivity)特性，例如：
用 ${\color{Yellow} \blacksquare }\preceq {\color{Green} \blacksquare}$ 表示 ${\color{Yellow} \blacksquare}$ 的特征在各个维度上都要小于 ${\color{Green} \blacksquare}$ 的特征（二维上看如下图所示）
在这里插入图片描述
左下角的这个黄色方块其实表示了是绿色方块的子图关系。

根据transitivity特性，则有：
$\text{if }{\color{Yellow} \blacksquare }\preceq {\color{Green} \blacksquare},{\color{Green} \blacksquare }\preceq {\color{Red} \blacksquare}\text{ then }{\color{Yellow} \blacksquare }\preceq {\color{Red} \blacksquare}$
用这个空间来判断子图关系：
在这里插入图片描述

Order embedding space的特点

Order embedding space很好的表达了子图的同构关系，同时还有以下三个特性：
1.transitivity：如果 $G_1$ 是 $G_2$ 的子图， $G_2$ 是 $G_3$ 的子图，那么 $G_1$ 是 $G_3$ 的子图。
$\text{if }{\color{Yellow} \blacksquare }\preceq {\color{Green} \blacksquare},{\color{Green} \blacksquare }\preceq {\color{Red} \blacksquare}\text{ then }{\color{Yellow} \blacksquare }\preceq {\color{Red} \blacksquare}$
2.anti-symmetry：如果 $G_1$ 是 $G_2$ 的子图， $G_2$ 是 $G_1$ 的子图，那么 $G_1$ 和 $G_2$ 同构。
$\text{if }{\color{Yellow} \blacksquare }\preceq {\color{Green} \blacksquare}\text{ and }{\color{Green} \blacksquare }\preceq {\color{Yellow} \blacksquare}\text{ then }{\color{Yellow} \blacksquare }= {\color{Green} \blacksquare}$
3.closure under intersection：仅有单个节点的图是所有图的子图；封闭交叉点，任意一个图的表征在左下角的0点有相同的交点，也有一个推论
推论：
$\text{if }{\color{Yellow} \blacksquare }\preceq {\color{Green} \blacksquare},{\color{Green} \blacksquare }\preceq {\color{Red} \blacksquare}\text{ then }{\color{Yellow} \blacksquare }\text{ has a valid embedding}$
推论的图见下面最右边。
在这里插入图片描述

Order Constraint

如何使得GNN学习到的图表征能有上面提到的Order Constraint（可以在空间中反映子图的关系）呢？这里要讲如何设计一个loss函数。
在这里插入图片描述

这个loss函数非常直觉，如果图 $G_Q$ 是 $G_T$ 的子图，那么对于所有的维度 $i$ 都满足上面的条件： $z_q[i]\leq z_t[i]$ （PPT下标有错，改成t就好）
然后用max-margin loss来训练，两个图的距离可以用下面公式表示：
$E(G_Q,G_t)=\sum_{i=1}^D(\max(0,z_q[i]-z_t[i]))^2$
上式中如果 $z_q[i]\leq z_t[i]$ ，那么 $max(0,z_q[i]-z_t[i])=0$
在这里插入图片描述
有了这个距离表达，我们的目标就学习到图的表征( $z_q,z_t$ )，使得损失函数最小化。

Training neural subgraph matching

既然是max-margin loss，当然要引入margin来进行训练，这里用 $\alpha$ 来表示margin。先构建数据集，包含训练数据对 $G_q,G_t)$ ，有是子图关系的数据对也有非子图关系的数据对。
对正样本：最小化 $E(G_Q,G_t)$
对负样本：最小化 $\max\left(0,\alpha-E(G_Q,G_t)\right)$
这里引入margin是防止模型将负样本的距离弄得太远，因为负样本不是子图关系，但是大家毕竟都是图的表征，如果分开太远就没有办法关注整体的分类误差。
通俗的说就是即使是负样本，那么距离在 $\alpha$ 以内就可以了，离太远没啥用。

Dataset Construction

构建数据集：从 $G$ 中生成 $G_Q和G_T$

Get $G_T$ by choosing a random anchor $v$ and taking all nodes in $G$ within distance $K$ from $v$ to be in $G_T$
正样本：Use BFS sampling to get $G_Q$ . Sample induced subgraph of $G_T$ :
2.1 Initialize $S=\{v\},V=\varnothing$
2.2 Let $N (S)$ be all neighbors of nodes in $S$ . At every step, sample 10% of the nodes in $𝑁(𝑆)\setminus V$ and place them in %𝑆%. Place the remaining nodes of $N (S)$ in $V$ .
2.3 After $K$ steps, take the subgraph of $G$ induced by $S$ anchored at $v$
负样本：For negative examples ( $G_Q$ not subgraph of $G_T$ , “corrupt” $G_Q$ by adding/removing nodes/edges so it’s no longer a subgraph.

在这里插入图片描述

Training Detail

How many training examples to sample?
§ At every iteration, we sample new training pairs
§ Benefit: Every iteration, the model sees different subgraph examples
§ Improves performance and avoids overfitting – since there are exponential number of possible subgraphs to sample from
How deep is the BFS sampling?
§ A hyper-parameter that trades off runtime and performance
§ Usually use 3-5, depending on size of the dataset

Test Detail

Given: query graph $𝐺_q$ . anchored at node 𝑞, target graph $𝐺_t$ anchored at node 𝑡

Goal: output whether the query is a node anchored subgraph of the target

Procedure:
If $E(G_q,G_t)<\epsilon$ , predict “True”; else “False”
𝜖 is a hyper-parameter
To check if $G_Q$ is isomorphic to a subgraph of $G_T$ , repeat this procedure for all $𝑞 ∈ 𝐺_Q, 𝑡 ∈𝐺_T$ . Here $𝐺_q$ is the neighborhood around node $𝑞 ∈ 𝐺_Q$ .

小结

Embedding graphs within an order embedding space解决了子图同构的NP-hard问题

Mining / Finding Frequent Motifs / Subgraphs

有两个挑战：
1.穷举所有大小为k的motif（connected subgraphs）
在这里插入图片描述
2.计算每种motif在图中出现的次数。

NP-hard循环套NP-hard=男上加男
解决方案：

Representation Learning

对于问题1可以通过search space解决，不直接枚举k节点的子图，而是从小子图开始，每次增加一个节点，直到k个节点为止，具体看后面。
对于问题2可以通过上节的GNN子图同构分类问题解决。

Setup

给定图 $G_T$ ，motif大小 $k$ ，结果数量阈值 $r$
找出所有大小为 $k$ 的motif，并在图 $G_T$ 中识别出 $r$ 出现频率最高的motif。
这里的子图用之前的node-level definition：

在这里插入图片描述

SPMiner: overview

SPMiner: a neural model to identify frequent motifs
在这里插入图片描述
输入是 $G_T$
然后是分解为anchor neighborhood
然后用GNN学习到order embedding space
最后是将motif逐步增大，并在order embedding space找出对应的最大出现频率
中间两个步骤是上节的内容。
因此来看最后一步的思想。

Motif Frequency Estimation

从 $G_T$ 中随机采样一个子图集合 $G_{N_i}$ ，并以node-anchored neighborhoods方式表示。
然后估计 $G_Q$ 的出现频率：计算 $G_{N_i}$ 满足以下条件的个数：
$Z_Q\leq Z_{N_i}$
这个估计根据之前order embedding space的transition特性得来的
$\text{if }{\color{Yellow} \blacksquare }\preceq {\color{Green} \blacksquare},{\color{Green} \blacksquare }\preceq {\color{Red} \blacksquare}\text{ then }{\color{Yellow} \blacksquare }\preceq {\color{Red} \blacksquare}$
在这里插入图片描述
这个方法的好处是计算很快（ $G_{N_i}$ 可以提前计算好，剩下都是比较运算当然快）

SPMiner search procedure

下面步骤中的目标是最大化红色阴影区域中点的数量

Initial step

Start by randomly picking a starting node 𝑢 in the target graph. Set $S=\{u\}$
之前说过，closure under intersection：仅有单个节点的图是所有图的子图。因此刚开始的时候，目标图 $G_T$ 中所有采样出来的子图的embedding都在节点 $u$ 的右上角
在这里插入图片描述

Iteratively

Grow a motif by iteratively choosing a neighbor of a node in 𝑆, and adding that node to 𝑆 We want to grow motifs to find larger motifs!
这里讨论两个节点组成motif是没有意义的，从三个节点开始。
这里的motif增加节点要在蓝色节点中选，避免的motif的生成计算。

在这里插入图片描述