图机器学习基础知识——CS224W(12-motifs)

CS224W: Machine Learning with Graphs

Stanford / Winter 2021

12-motifs

Subgraphs and Motifs

Subgraphs and Motifs

Definition: Subgraph

Definition: Subgraph

  • Node-induced subgraph

    • Take subset of the nodes and all edges induced by the nodes (由节点集和与这些节点相关的边所建立的子图)

    • G ′ = ( V ′ , E ′ ) G^{\prime}=\left(V^{\prime}, E^{\prime}\right) G=(V,E) is a node induced subgraph iff

      • V ′ ⊆ V V^{\prime} \subseteq V VV

      • E ′ = { ( u , v ) ∈ E ∣ u , v ∈ V ′ } E^{\prime}=\left\{(u, v) \in E \mid u, v \in V^{\prime}\right\} E={(u,v)Eu,vV}

      • G ′ G' G is the subgraph of G G G induced by V ′ V' V

  • Edge-induced subgraph

    • Take subset of the edges and all corresponding nodes (由边集和这些边对应的节点所建立的子图)

      • G ′ = ( V ′ , E ′ ) G^{\prime}=\left(V^{\prime}, E^{\prime}\right) G=(V,E) is an edge induced subgraph iff

        • E ′ ⊆ E E^{\prime} \subseteq E EE

        • V ′ = { v ∈ V ∣ ( v , u ) ∈ E ′ V^{\prime}=\left\{v \in V \mid(v, u) \in E^{\prime}\right. V={vV(v,u)E for some u } \left.u\right\} u}

  • The best definition depends on the domain (哪种定义适用取决于具体的问题)

    • Chemistry: node-induced (functional groups)

    • Knowledge graphs: Often edge-induced (focus is on edges representing logical relations)

  • Graph Isomorphism: Check whether two graphs are identical

    • G 1 = ( V 1 , E 1 ) G_{1}=\left(V_{1}, E_{1}\right) G1=(V1,E1) and G 2 = ( V 2 , E 2 ) G_{2}=\left(V_{2}, E_{2}\right) G2=(V2,E2) are isomorphic if there exists bijection f : V 1 → V 2 f: V_{1} \rightarrow V_{2} f:V1V2 such that ( u , v ) ∈ E 1 (u,v) \in E_1 (u,v)E1 iff ( f ( a ) , f ( b ) ) ∈ E 2 (f(a), f(b)) \in E_{2} (f(a),f(b))E2 (若两张图存在一个双射函数,使一个图的所有节点能唯一映射到另一个图的节点,那么这两张图是同构的)

    在这里插入图片描述

  • 前述的定义中,无论是哪种方式,子图的节点和边都来自原图 G G G。如果子图的节点和边来自于一个完全不同的图呢?

    • G 2 G_2 G2 is subgraph-isomorphic to G 1 G_1 G1 if some subgraph of G 2 G_2 G2 is isomorphic to G 1 G_1 G1

      • 也可以直接说 G 1 G_1 G1 G 2 G_2 G2的一个子图

      • We can use either the node-induced or edge-induced definition of subgraph

    在这里插入图片描述

  • Case Example of Subgraphs

    在这里插入图片描述

    在这里插入图片描述

Network Motifs

Network Motifs

  • Network Motifs: recurring, significant patterns of interconnections

    • Pattern: Small (node-induced) subgraph

    • Recurring: Found many times, i.e., with high frequency (How to define frequency?)

    • Significant: More frequent than expected, i.e., in randomly generated graphs (How to define random graphs?)

Subgraph Frequency

Subgraph Frequency

  • Graph-level Subgraph Frequency Definition

    Let G Q G_Q GQ be a small graph and G T G_T GT be a target graph dataset

    • Frequency of G Q G_Q GQ in G T G_T GT: number of unique subsets of nodes V T V_T VT of G T G_T GT for which the subgraph of G T G_T GT induced by the nodes V T V_T VT is isomorphic to G Q G_Q GQ

    在这里插入图片描述

  • Node-level Subgraph Frequency Definition

    Let G Q G_Q GQ be a small graph, v v v be a node in G Q G_Q GQ (the “anchor”) and G T G_T GT be a target graph dataset

    • The number of nodes u u u in G T G_T GT for which some subgraph of G T G_T GT is isomorphic to G Q G_Q GQ and the isomorphism maps u u u to v v v (不同子图的锚点不能重复)

    • Let ( G Q , v ) (G_Q,v) (GQ,v) be called a node-anchored subgraph

    在这里插入图片描述

Motif Significance

Motif Significance

  • To define significance we need to have a null-model (i.e., point of comparison)

  • Key Idea: Subgraphs that occur in a real network much more often than in a random network have functional significance (若一个子图在真实网络中出现的频率比在多个随机生成的统计相似的图中出现的平均频率还要高,那么就说明这个子图具备一定的显著性)

  • Erdős–Rényi (ER) random graphs

    • G n , p G_{n,p} Gn,p: undirected graph on n n n nodes where each edge ( u , v ) (u,v) (u,v) appears i.i.d. with probability p p p (每两节点之间连边的概率为 p p p)

    在这里插入图片描述

  • Configuration Model

    • Generate a random graph with a given degree sequence k 1 , k 2 , … k N k_{1}, k_{2}, \ldots k_{N} k1,k2,kN (节点的度分布要保持一致)

    • Useful as a “null” model of networks

      Null Model: Each G r a n d G^{rand} Grand has the same #(nodes), #(edges) and degree distribution as G r e a l G^{real} Greal

      • We can compare the real network G r e a l G^{real} Greal and a random G r a n d G^{rand} Grand which has the same degree sequence as G r e a l G^{real} Greal

    在这里插入图片描述

  • Intuition: Motifs are overrepresented in a network when compared to random graphs

    • Step 1: Count motifs in the given graph ( G r e a l G^{real} Greal) (在原图中统计motifs)

    • Step 2: Generate random graphs with similar statistics (e.g. number of nodes, edges, degree sequence), and count motifs in the random graphs (以与原图相似的统计信息构建多个随机生成图并分别统计motifs)

    • Step 3: Use statistical measures to evaluate how significant is each motif (使用一些统计评价标准衡量motifs significance)

      • Use Z-score
  • Z-score for Statistical Significance

    • Z i Z_i Zi captures statistical significance of motif i i i

      Z i = ( N i real  − N ˉ i rand  ) / std ⁡ ( N i rand  ) Z_{i}=\left(N_{i}^{\text {real }}-\bar{N}_{i}^{\text {rand }}\right) / \operatorname{std}\left(N_{i}^{\text {rand }}\right) Zi=(Nireal Nˉirand )/std(Nirand )
      其中, N i real  N_{i}^{\text {real }} Nireal 为原图 G r e a l G^{real} Greal中第 i i i个motif的数量; N ˉ i rand  \bar{N}_{i}^{\text {rand }} Nˉirand 为多个随机生成图中的第 i i i个motif的平均频数;std为多个随机生成图中的第 i i i个motif的标准差

    • Negative values indicate under-representation

    • Positive values indicate over-representation

  • Network significance profile (SP)

    S P i = Z i / ∑ j Z j 2 S P_{i}=Z_{i} / \sqrt{\sum_{j} Z_{j}^{2}} SPi=Zi/jZj2

    • SP is a vector of normalized Z-scores

    • The dimension depends on number of motifs considered (维度取决于考虑多少个不同的motifs)

    • SP emphasizes relative significance of subgraphs (SP强调的是子图间的相对显著性程度)

  • Example

    在这里插入图片描述

Neural Subgraph Matching

Neural Subgraph Matching

  • Task

    在这里插入图片描述

    • 由于寻找子图是NP-hard,故使用神经网络来进行子图匹配

    在这里插入图片描述

  • Neural Architecture for Subgraphs

    • 这里要使用node-anchored neighborhoods,使用GNN捕捉 u u u v v v (分别是Query和Target的锚点) 的representation,预测节点 u u u的邻居结构是否与节点 v v v的邻居结构同构

    • Why Anchor?

      • We not only predict if there exists a mapping, but also a identify corresponding nodes ( u u u and v v v)

    在这里插入图片描述

  • Decomposing G T G_T GT into Neighborhoods

    • For each node in G T G_T GT

      • Obtain a k-hop neighborhood around the anchor

      • Can be performed using breadth-first search (BFS)

      • The depth k k k is a hyper-parameter (e.g. 3)

        • Larger depth results in more expensive model
    • Same procedure applies to G Q G_Q GQ to obtain the neighborhoods

    • We embed the neighborhoods using a GNN (上述的步骤都可以用GNN来替代)

      • By computing the embeddings for the anchor nodes in their respective neighborhoods (在锚点各自的邻居结构中计算embeddings,与GNN的Message Passing过程完全对应)
Order Embedding Space

Order Embedding Space

  • Capture Partial Ordering (偏序)

    左下的embedding是右上的embedding的子图

    在这里插入图片描述

    在这里插入图片描述

  • Why Order Embedding Space?

    Subgraph isomorphism relationship can be nicely encoded in order embedding space (下述属性都完美对应到了Order Embedding Space中)

    • Transitivity (传递性): G 1 G_1 G1 G 2 G_2 G2的子图, G 2 G_2 G2 G 3 G_3 G3的子图,那么 G 1 G_1 G1 G 3 G_3 G3的子图

    • Anti-symmetry (反对称性): G 1 G_1 G1 G 2 G_2 G2的子图, G 2 G_2 G2 G 1 G_1 G1的子图,那么 G 1 G_1 G1 G 2 G_2 G2同构

    • Closure under intersection: 仅含一个节点的图是所有图的子图

    在这里插入图片描述

  • Loss Function: Order Constraint

    • max-margin loss

      E ( G q , G t ) = ∑ i = 1 D ( max ⁡ ( 0 , z q [ i ] − z t [ i ] ) ) 2 E\left(G_{q}, G_{t}\right)=\sum_{i=1}^{D}\left(\max \left(0, z_{q}[i]-z_{t}[i]\right)\right)^{2} E(Gq,Gt)=i=1D(max(0,zq[i]zt[i]))2

    在这里插入图片描述

  • Training Neural Subgraph Matching

    • To learn such embeddings, construct training examples ( G q , G t ) (G_q, G_t) (Gq,Gt) where half the time, G q G_q Gq is a subgraph of G t G_t Gt, and the other half, it is not

    • Train on these examples by minimizing the following max-margin loss

      • For positive examples: Minimize E ( G q , G t ) E(G_q, G_t) E(Gq,Gt) when G q G_q Gq is a subgraph of G t G_t Gt

      • For negative examples: Minimize max ⁡ ( 0 , α − E ( G q , G t ) ) \max \left(0, \alpha-E\left(G_{q}, G_{t}\right)\right) max(0,αE(Gq,Gt))

        • Max-margin loss prevents the model from learning the degenerate strategy of moving embeddings further and further apart forever
  • Training Example Construction

    在这里插入图片描述

  • Training Details

    在这里插入图片描述

  • Subgraph Predictions on New Graphs

    在这里插入图片描述

Mining Frequent Motifs

Mining Frequent Motifs

  • Finding Frequent Subgraphs (找出给定size-k下,在目标图中出现频率最大的子图,这里的子图匹配使用node-anchored subgraph)

    • Enumerating all size-k connected subgraphs

    • Counting #(occurrences of each subgraph type)

  • Why is it Hard?

    • 给定size-k下的可能子图组合形式就已经需要很大计算量了

    • 判断子图从属关系是NP-hard,所以计数计算量也很大

SPMiner

SPMiner

  • Overview: a neural model to identify frequent motifs

    在这里插入图片描述

  • Key Idea: order embedding之后,可以很方便的计算出以某个节点为锚点,k-hop (size-k)的子图在目标图中出现的频数,只需要计算该节点在embedding space中,有多少节点的embedding在vector space中位于它的右上区域

    在这里插入图片描述

  • SPMiner Search Procedure

    SPMiner采用一种搜索策略,从size-1的子图开始搜索,直到达到要求的size-k,并保证搜索到达size-k时,能找到size-k各子图频数的最大值

    • Initial step: 在目标图中随机选取一个节点 u u u,令 S = u S={u} S=u

      在这里插入图片描述

    • Iteratively: 通过不断迭代每次选取一个集合 S S S中节点的邻居,并加入到 S S S中,以此逐渐扩大motif的规模

      在这里插入图片描述

      • 我们希望迭代时能不断找到频数最大的motifs

      • Total violation of a subgraph G G G: the number of neighborhoods that do not contain G G G

        • The number of neighborhoods G N i G_{N_{i}} GNi that do not satisfy Z Q ⩽ Z N i Z_{Q} \leqslant Z_{N_{i}} ZQZNi

        • Minimizing total violation = maximizing frequency

      • 基于上述定义,我们可以直接使用启发式贪心策略:在每一步迭代中,添加能使total violation最小的邻居节点

    • Termination: 直到达到指定的motif size-k,红色阴影区域的点数量即为size-k motif的最大频数

      在这里插入图片描述

  • 19
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值