网络中,高阶链接模式是控制和调节复杂系统的基本结构,大部分高阶结构是指一个小的子图,这种小的子图是复杂系统的建筑块。例如,正反馈回路是调控网络的关键要素,三元组是社交网络的关键,双向开三角结构是大脑hub节点的关键,开三角结构是航空网络的关键模式。这里介绍高阶结构,并提出一种聚类框架。
给定一个网络模块
M
M
M,寻找一种聚类
S
S
S以满足两种目标。首先,节点应参与尽量多的模块
M
M
M,其次集合
S
S
S应避免破坏模块
M
M
M。即,给定模块
M
M
M,高阶聚类方法的目标是最小
(1)
ϕ
M
(
S
)
=
c
u
t
M
(
S
,
S
‾
)
/
m
i
n
[
v
o
l
M
(
S
)
,
v
o
l
M
(
S
‾
)
]
\phi_M(S)=cut_M(S,\overline{S})/min[vol_M(S),vol_M(\overline{S})]\tag{1}
ϕM(S)=cutM(S,S)/min[volM(S),volM(S)](1)
c
u
t
M
(
S
,
S
‾
)
cut_M(S,\overline{S})
cutM(S,S)是被分割开来的
M
M
M的个数,
v
o
l
M
(
S
)
vol_M(S)
volM(S)是
S
S
S中
M
M
M的节点数。算法借鉴谱聚类的方法,算法流程是
1.给定一个网络和一种模块
M
M
M,算出矩阵
W
M
W_M
WM,其中元素
(
i
,
j
)
(i,j)
(i,j)表示节点
i
,
j
i,j
i,j共现于
M
M
M中的次数。
2.计算由
W
M
W_M
WM产生的拉普拉斯谱排序
σ
\sigma
σ。
3.找到
σ
\sigma
σ排序的前
r
r
r个组成的集合作为
S
r
=
{
σ
1
,
…
,
σ
r
}
S_r=\{\sigma_1,\dots,\sigma_r\}
Sr={σ1,…,σr},依据是
S
:
=
a
r
g
m
i
n
r
ϕ
M
(
S
r
)
S:=argmin_r\phi_M(S_r)
S:=argminrϕM(Sr)。
我们的方法统一了模块分析和网络聚类这两个基础方向,揭示了组织结构和模块。之前的分析没有给出最差情况下的聚类保证,以及随网络规模增加而导致的复杂度,补充材料里的理论结果说明了超图聚类方法对于特殊的有向图是更加通用的。
考虑无向图
G
=
(
V
,
E
)
G=(V,E)
G=(V,E),
∣
V
∣
=
n
|V|=n
∣V∣=n,进一步假设没有独立点,
W
W
W记为图的权重,对角矩阵
D
D
D记为
D
i
i
=
∑
j
=
1
n
W
i
j
D_{ii}=\sum_{j=1}^nW_{ij}
Dii=∑j=1nWij,拉普拉斯矩阵为
L
=
D
−
W
L=D-W
L=D−W。对于集合
S
S
S,则
ϕ
(
G
)
(
S
)
\phi^{(G)}(S)
ϕ(G)(S)为
(3)
ϕ
(
G
)
(
S
)
=
c
u
t
(
G
)
(
S
,
S
‾
)
/
m
i
n
(
v
o
l
(
G
)
(
S
)
,
v
o
l
(
G
)
(
S
‾
)
)
\phi^{(G)}(S)=cut^{(G)}(S,\overline{S})/min(vol^{(G)}(S),vol^{(G)}(\overline{S}))\tag{3}
ϕ(G)(S)=cut(G)(S,S)/min(vol(G)(S),vol(G)(S))(3)
(4)
c
u
t
(
G
)
(
S
,
S
‾
)
=
∑
i
∈
S
,
j
∈
S
‾
W
i
j
cut^{(G)}(S,\overline{S})=\sum_{i\in S,j\in\overline{S}}W_{ij}\tag{4}
cut(G)(S,S)=i∈S,j∈S∑Wij(4)
(5)
v
o
l
(
G
)
(
S
)
=
∑
i
∈
S
D
i
i
vol^{(G)}(S)=\sum_{i\in S}D_{ii}\tag{5}
vol(G)(S)=i∈S∑Dii(5)
(6)
c
u
t
(
G
)
(
S
,
S
‾
)
=
w
e
i
g
h
t
e
d
s
u
m
o
f
e
d
g
e
s
t
h
a
t
a
r
e
c
u
t
cut^{(G)}(S,\overline{S})=weighted\ sum\ of\ edges\ that\ are\ cut\tag{6}
cut(G)(S,S)=weighted sum of edges that are cut(6)
(7)
v
o
l
(
G
)
(
S
)
=
w
e
i
g
h
t
e
d
n
u
m
b
e
r
o
f
e
d
g
e
e
n
d
p
o
i
n
t
s
i
n
S
vol^{(G)}(S)=weighted\ number\ of\ edge\ end\ points\ in\ S\tag{7}
vol(G)(S)=weighted number of edge end points in S(7)
x
x
x表示集合
S
S
S的向量,
x
i
=
1
x_i=1
xi=1表示节点
i
i
i在
S
S
S中,
x
i
=
0
x_i=0
xi=0表示节点
i
i
i在
S
‾
\overline{S}
S中,若边
(
i
,
j
)
(i,j)
(i,j)被分开,则
(
x
i
−
x
j
)
2
=
1
(x_i-x_j)^2=1
(xi−xj)2=1,反之
(
x
i
−
x
j
)
2
=
0
(x_i-x_j)^2=0
(xi−xj)2=0,于是
(9)
x
T
L
x
=
c
u
t
(
G
)
(
S
,
S
‾
)
=
∑
(
i
,
j
)
∈
E
w
i
j
(
x
i
−
x
j
)
2
x^TLx=cut^{(G)}(S,\overline{S})\tag{9}=\sum_{(i,j)\in E}w_{ij}(x_i-x_j)^2
xTLx=cut(G)(S,S)=(i,j)∈E∑wij(xi−xj)2(9)
在这里定义
(
B
,
A
)
(B,\mathcal{A})
(B,A),其中
B
B
B是一个
k
×
k
k\times k
k×k的二值矩阵,
A
\mathcal{A}
A是一个节点集合,
B
B
B编码连边模式,
A
\mathcal{A}
A表示模块的子集,在很多时候
A
表
示
节
点
的
整
个
集
合
\mathcal{A}表示节点的整个集合
A表示节点的整个集合。
s
e
t
(
⋅
)
set(\cdot)
set(⋅)将有序元组变成无序元组,
s
e
t
(
(
v
1
,
v
2
,
…
,
v
k
)
)
=
{
v
1
,
v
2
,
…
,
v
k
}
set((v_1,v_2,\dots,v_k))=\{v_1,v_2,\dots,v_k\}
set((v1,v2,…,vk))={v1,v2,…,vk},
(10)
M
(
B
,
A
)
=
{
(
s
e
t
(
v
)
,
s
e
t
(
χ
A
(
v
)
)
)
∣
v
∈
V
k
,
v
1
,
…
,
v
k
}
M(B,\mathcal{A})=\{(set(v),set(\chi_\mathcal{A}(v)))|v\in V^k,v_1,\dots,v_k\}\tag{10}
M(B,A)={(set(v),set(χA(v)))∣v∈Vk,v1,…,vk}(10)
若
χ
A
(
v
)
=
v
\chi_\mathcal{A}(v)=v
χA(v)=v,则为simple motifs,否则为anchored motifs。
公式(10)可以重新写为
(17)
c
u
t
M
(
G
)
(
S
,
S
‾
)
=
∑
(
v
,
χ
A
(
v
)
)
∈
M
1
(
∃
i
,
j
∈
χ
A
(
v
)
∣
i
∈
S
,
j
∈
S
‾
)
cut_M^{(G)}(S,\overline{S})=\sum_{(v,\chi_A(v))\in M}1(\exists i,j\in\chi_\mathcal{A}(v)|i \in S,j\in\overline{S})\tag{17}
cutM(G)(S,S)=(v,χA(v))∈M∑1(∃i,j∈χA(v)∣i∈S,j∈S)(17)
(18)
v
o
l
M
(
G
)
(
S
)
=
∑
(
v
,
χ
A
)
∈
M
∑
i
∈
χ
A
(
v
)
1
(
i
∈
S
)
vol_M^{(G)}(S)=\sum_{(v,\chi_\mathcal{A})\in M}\sum_{i\in\chi_\mathcal{A}(v)}1(i\in S)\tag{18}
volM(G)(S)=(v,χA)∈M∑i∈χA(v)∑1(i∈S)(18)
(20)
(
W
M
)
i
j
=
∑
(
v
,
χ
A
)
∈
M
1
(
{
i
,
j
}
)
⊂
χ
A
(
v
)
)
(W_M)_{ij}=\sum_{(v,\chi_\mathcal{A})\in M}1(\{i,j\})\subset\chi_\mathcal{A}(v))\tag{20}
(WM)ij=(v,χA)∈M∑1({i,j})⊂χA(v))(20)
(
D
M
)
i
i
=
∑
j
=
1
n
(
W
M
)
i
j
(D_M)_{ii}=\sum_{j=1}^n(W_M)_{ij}
(DM)ii=∑j=1n(WM)ij,
L
M
=
D
M
−
W
M
L_M=D_M-W_M
LM=DM−WM,最后
L
M
=
D
M
−
1
/
2
L
M
D
M
−
1
/
2
=
I
−
D
M
−
1
/
2
W
M
D
M
−
1
/
2
\mathcal{L}_M=D_M^{-1/2}L_MD_M^{-1/2}=I-D_M^{-1/2}W_MD_M^{-1/2}
LM=DM−1/2LMDM−1/2=I−DM−1/2WMDM−1/2,使用特征向量
L
M
\mathcal{L}_M
LM进行分类。
引理1.令
G
=
(
V
,
E
)
G=(V,E)
G=(V,E)为一个无权重有向图,
G
M
G_M
GM是基于模块的有权重图,
∣
A
∣
≥
2
|\mathcal{A}|\geq 2
∣A∣≥2则对于任意
S
⊂
V
S\subset V
S⊂V,有
v
o
l
M
(
G
)
(
S
)
=
1
∣
A
∣
−
1
v
o
l
(
G
M
)
(
S
)
vol_M^{(G)}(S)=\frac{1}{|\mathcal{A}|-1}vol^{(G_M)}(S)
volM(G)(S)=∣A∣−11vol(GM)(S)
引理2.令
x
i
,
x
j
,
x
k
∈
{
−
1
,
1
}
x_i,x_j,x_k\in\{-1,1\}
xi,xj,xk∈{−1,1},则
4
⋅
1
(
x
i
,
x
j
x
k
n
o
t
a
l
l
t
h
e
s
a
m
e
)
=
x
i
2
+
x
j
2
+
x
k
2
−
x
i
x
j
−
x
j
x
k
−
x
k
x
i
4\cdot1(x_i,x_jx_k\ not\ all\ the\ same)=x_i^2+x_j^2+x_k^2-x_ix_j-x_jx_k-x_kx_i
4⋅1(xi,xjxk not all the same)=xi2+xj2+xk2−xixj−xjxk−xkxi
引理3.令
z
∈
{
0
,
1
}
n
z\in\{0,1\}^n
z∈{0,1}n,若
z
i
=
1
z_i=1
zi=1则
x
i
=
1
x_i=1
xi=1,若
z
i
=
1
z_i=1
zi=1则
x
i
=
−
1
x_i=-1
xi=−1,则对于
L
=
D
−
W
L=D-W
L=D−W,有
4
z
T
L
z
=
x
T
L
x
4z^TLz=x^TLx
4zTLz=xTLx。
引理4.令
G
=
(
V
,
E
)
G=(V,E)
G=(V,E)有向无权重图,
G
M
G_M
GM是基于
∣
A
∣
=
3
|\mathcal{A}|=3
∣A∣=3的模块的有权重图,对于
S
⊂
V
S\subset V
S⊂V,有
c
u
t
M
(
G
)
(
S
,
S
‾
)
=
1
2
c
u
t
(
G
M
)
(
S
,
S
‾
)
cut^{(G)}_M(S,\overline{S})=\frac{1}{2}cut^{(G_M)}(S,\overline{S})
cutM(G)(S,S)=21cut(GM)(S,S)
定理5.令
G
=
(
V
,
E
)
G=(V,E)
G=(V,E)有向无权重图,
W
M
W_M
WM有权重邻接矩阵,且模块
∣
A
∣
=
3
|\mathcal{A}|=3
∣A∣=3,则对于
S
⊂
V
S\subset V
S⊂V,有
ϕ
M
(
G
)
(
S
)
=
ϕ
(
G
M
)
(
S
)
\phi^{(G)}_M(S)=\phi^{(G_M)}(S)
ϕM(G)(S)=ϕ(GM)(S)
定理6.假设使用算法1找到较好的集合
S
S
S,令
ϕ
∗
=
m
i
n
S
′
ϕ
M
(
G
)
(
S
′
)
\phi_*=min_{S'}\phi^{(G)}_M(S')
ϕ∗=minS′ϕM(G)(S′)为最优集合,则
1.
ϕ
M
(
G
)
≤
4
ϕ
∗
a
n
d
\phi^{(G)}_M\leq 4\sqrt{\phi^*}\ and
ϕM(G)≤4ϕ∗ and
2.
ϕ
∗
≥
λ
2
/
2
\phi^*\geq\lambda_2/2
ϕ∗≥λ2/2
引理7.令
x
i
,
x
j
x
k
,
x
l
∈
{
−
1
,
1
}
x_i,x_jx_k,x_l\in\{-1,1\}
xi,xjxk,xl∈{−1,1},则有
(21)
8
⋅
1
(
x
i
,
x
j
,
x
k
,
x
l
n
o
t
a
l
l
t
h
e
s
a
m
e
)
=
(
7
−
x
i
x
j
−
x
i
x
k
−
x
i
x
l
−
x
j
x
k
−
x
j
x
l
−
x
k
x
l
−
x
i
x
j
x
k
x
l
)
8\cdot1(x_i,x_j,x_k,x_l\ not\ all\ the\ same)=(7-x_ix_j-x_ix_k-x_ix_l-x_jx_k-x_jx_l-x_kx_l-x_ix_jx_kx_l)\tag{21}
8⋅1(xi,xj,xk,xl not all the same)=(7−xixj−xixk−xixl−xjxk−xjxl−xkxl−xixjxkxl)(21)
引理8.令
G
=
(
V
,
E
)
G=(V,E)
G=(V,E)有向无权重图,
G
M
G_M
GM基于
∣
A
∣
=
4
|\mathcal{A}|=4
∣A∣=4模块有权重图,则对于
S
⊂
V
S\subset V
S⊂V,有
c
u
t
M
(
G
)
(
S
,
S
‾
)
=
1
3
c
u
t
(
G
M
)
(
S
,
S
‾
)
−
∑
(
v
,
{
i
,
j
,
k
,
l
}
∈
M
)
1
3
⋅
1
(
e
x
a
c
t
l
y
t
w
o
o
f
i
,
j
,
k
,
l
i
n
S
)
cut^{(G)}_M(S,\overline{S})=\frac{1}{3}cut^{(G_M)}(S,\overline{S})-\sum_{(v,\{i,j,k,l\}\in M)}\frac{1}{3}\cdot1(exactly\ two\ of\ i,j,k,l\ in\ S)
cutM(G)(S,S)=31cut(GM)(S,S)−(v,{i,j,k,l}∈M)∑31⋅1(exactly two of i,j,k,l in S)
定理9.令
G
=
(
V
,
E
)
G=(V,E)
G=(V,E)有向无权重图,
W
M
W_M
WM基于
∣
A
∣
=
4
|\mathcal{A}|=4
∣A∣=4模块邻接矩阵,则对于
S
⊂
V
S\subset V
S⊂V,有
ϕ
M
(
G
)
(
S
)
=
ϕ
(
G
M
)
(
S
)
−
∑
(
v
,
{
i
,
j
,
k
,
l
}
)
∈
M
1
(
e
x
a
c
t
l
y
t
w
o
o
f
i
,
j
,
k
,
l
i
n
S
)
v
o
l
(
G
M
)
(
S
)
\phi^{(G)}_M(S)=\phi^{(G_M)}(S)-\frac{\sum_{(v,\{i,j,k,l\})\in M}1(exactly\ two\ of\ i,j,k,l\ in\ S)}{vol^{(G_M)}(S)}
ϕM(G)(S)=ϕ(GM)(S)−vol(GM)(S)∑(v,{i,j,k,l})∈M1(exactly two of i,j,k,l in S)
Matlab代码
% function [S,Sbar,conductances]=MotifSpectralPartitionM6(A)
% Spectral partitioning for motif M_6
B = spones(A & A');%bidirectional links
U = A - B ; %unidirectional links
% Form motif adjacency matrix for motif M_6.
% For different motifs , replace this line with another matrix for mulation.
W = (B * U') .* U' + (U * B) .* U + (U' * U) .* B;
% Compute eigen vector of motif normalized Laplacian
Dsqrt = full(sum(W,2));
Dsqrt(Dsqrt ~= 0) = 1 ./ sqrt(Dsqrt(Dsqrt ~= 0));
[I , J , V] = find(W);
Ln = sparse(I , J , -V .* (Dsqrt(I) .* Dsqrt(J)) , size(A, 1) , size(A, 2));
[Z , lambdas] = eigs(Ln , 2 , 'sa');
% Matlab's eigs is sometimes out of order
[~ , eig_order] = sort(diag(lambdas));
% y = Dsqrt .* Z(: , eig_order(end));
y = Z(: , eig_order(end));
% Linear time sweep procedure
[~ , order] = sort(y);
C = W(order , order);
C_sums = full(sum(C , 2));
volumes = cumsum(C_sums);
volumes_other = full(sum(sum(W))) * ones(length(order) , 1) - volumes;
conductances = cumsum (C_sums - 2 * sum(tril(C) , 2)) ./ min (volumes , volumes_other);
[~ , split] = min(conductances);
S = order(1 : split);
Sbar = order((split + 1) : end);
% function [S,Sbar,conductances]=MotifSpectralPartitionM4(A)
% Spectral partitioning for motif M_4
W=(A*A).*A;
% Compute eigen vector of motif normalized Laplacian
Dsqrt = full(sum(W,2));
Dsqrt(Dsqrt ~= 0) = 1 ./ sqrt(Dsqrt(Dsqrt ~= 0));
[I , J , V] = find(W);
Ln = sparse(I , J , -V .* (Dsqrt(I) .* Dsqrt(J)) , size(A, 1) , size(A, 2));
[Z , lambdas] = eigs(Ln , 2 , 'sa');
% Matlab's eigs is sometimes out of order
[~ , eig_order] = sort(diag(lambdas));
y = Dsqrt .* Z(: , eig_order(end));
% Linear time sweep procedure
[~ , order] = sort(y);
C = W(order , order);
C_sums = full(sum(C , 2));
volumes = cumsum(C_sums);
volumes_other = full(sum(sum(W))) * ones(length(order) , 1) - volumes;
conductances = cumsum (C_sums - 2 * sum(tril(C) , 2)) ./ min (volumes , volumes_other);
[~ , split] = min(conductances);
S = order(1 : split);
Sbar = order((split + 1) : end);