图上的迁移学习（二）

最新推荐文章于 2023-12-15 20:36:52 发布

晒太阳的喵喵

最新推荐文章于 2023-12-15 20:36:52 发布

阅读量975

点赞数 1

分类专栏：图上的迁移学习文章标签：图论数据挖掘

本文链接：https://blog.csdn.net/weixin_40239306/article/details/108745268

版权

图上的迁移学习专栏收录该内容

3 篇文章 3 订阅

订阅专栏

图上的迁移学习（二）

Domain Adaptive Classification on Heterogeneous Information Networks ——MuSDAC 笔记
Network Together: Node Classification via Cross-Network Deep Network Embedding——CDNE算法笔记
Adversarial Deep Network Embedding for Cross-network node classification——ACDNE算法笔记

Domain Adaptive Classification on Heterogeneous Information Networks ——MuSDAC 笔记

定义

异构网络（Heterogeneous Information Network, HIN）：
$\mathcal{G}=(\mathcal{V}, \varepsilon)$
其中 $\mathcal{v}=\cup_{i=1}^{n} \mathcal{V_i}$ ，有 n 个类型的点。 $\varepsilon=\cup_{i=1}^{m} \varepsilon_i$ ，有 m 个类型的边。
在网络中一个元路径（meta-path） $\Phi_i$ 是以下格式的一条路：
$\mathcal{V}_{i_1} \xrightarrow{\varepsilon_{i_1}} \mathcal{V}_{i_2} \xrightarrow{\varepsilon_{i_2}} \dots \xrightarrow{\varepsilon_{i_l}} \mathcal{V}_{i_{(l+1)}}$
定义了两个点 $\mathcal{V}_{i_1}$ 和 $\mathcal{V}_{i_{(l+1)}}$ 之间的组合关系。

multi-channel Network：
给定 $\mathcal{V}_{i_1}$ 为未决策分类的节点，通过 meta-path set $\Phi= \{ \Phi_1,\dots,\Phi_N \}$ 将 HIN 分解为multi-channel network。其中每个 channel 都是包含节点 $\mathcal{V}_{i_1}$ 与某一特定类型的元路径相连的同质网络。最终的网络被定义为
$G=\{ (\mathcal{V}_l, A_l) \vert l=1,\dots, N \}$
其中 $A_l$ 为元路径邻接矩阵，表示 $\mathcal{V}_l$ 中连接 $\mathcal{V_1}$ 每个节点对的元路径数。

MuSDAC 基本思想

解决的问题： 异构网络上的 Domain Adaptation(DA) 问题：
给定两个 HINs， $(\mathcal{G_S}, \mathcal{X_S})$ 和 $(\mathcal{G_T}, \mathcal{X_T})$ ，其中 $\mathcal{X}$ 代表 $\mathcal{V_1}$ 节点的feature matrix， $\mathcal{G_S}$ 和 $\mathcal{G_T}$ 分享相同的 node 和 edge 特征。在 HINs上的可转移分类目标是使用两个网络上的结构信息和 $\mathcal{V_{S,1}}$ 的 label 来预测 $\mathcal{V_{T,1}}$ 上的 label。

异构网络（HIN）上的 domain adaptation 面临的挑战：

HIN 包括多种语义，需要做域对齐。如果在一个 embedding space 并行它们是非常难的。
需要在领域相似性和可区分性之间进行平衡。因为领域不变的特征是同构的，并且对于对于分类来说是无信息的。另一方面，指示分类的特征通常是领域变化的。

解决的方法： 提出 Multi-space Domain Adaptive Classification(MuSDAC) 来解决 HINs 上的 DA 问题：

使用 multi-channel shared weight GCNs，将 HIN 的 node 投影到 multi-space 中，并且进行了两两对齐。使得每个空间的多层语义被单独保存。
提出一个两层选择策略，有效地聚集嵌入空间，以确保领域的相似性和可区分性。
第一层使用 heuristic combination sampling algorithm，有效地选择具有可区分性的通道组合。减轻了空间组合搜索的需要。
第二层使用 moving averaged weighted voting scheme，对第一层被选择的通道进行加权融合，以最小化 transfer 和 classification loss。

MuSDAC 算法框架

论文中MuSDAC的框架如下
MuSDAC

Multi-channel shared Weight GCN. 使用 meta-space set $\Phi$ ，将 $\mathcal{G_S}$ ， $\mathcal{G_T}$ 分解为多通道网络，并且将它们丢入到 N 个独立的 GCN 中，产生 original channel。最终得到 original channel embedding sets $C=\{ C_l \vert l=1,\dots, N \}$ ：
$C_l= \hat{A}_l \sigma(\hat{A}_l \mathcal{X} W_l^{(0)} ) W_l^{(1)}$
其中 $\hat{A}=\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}}$ 。在 channel $l$ 中，使用 shared parameter set ${ W_{l}^{0}, W_{l}^{(1)}\}$ 。
Multi-space Alignment.

使用 heuristic combination sampling algorithm (Algorithm 1) 生成特征可分辨的组合下标 $Z=\{ Z_j \vert j=1,\dots, M \}, Z_j \in \{1,\dots,N\}, M=O(N)$ 。
根据下标挑选出组合特征 $C_Z=\{ C_l \vert l \in Z \}$ 。
通过一维卷积组合 $C_Z$ 得到 aggregated channel embedding set $\mathcal{M}=\{ \mathcal{M_{Z_j} }\vert j=1,\dots, M \}$ ，其中 $\mathcal{M_{Z_j} }$ 为单个 embedding matrix
在第 $j$ 个 aggregated channel $Z_j$ 中，使用 $M_{Z_j,S}$ ， $M_{Z_j,T}$ 代表 source 和 target instance 的嵌入，用于预测分类器：
$\hat{y}_j=softmax(M_{Z_j}, W_j^C)$
其中 $W_{j}^{C}$ 为第 $j$ 个 channel 的分类器参数。
最终的预测是所有分类器输出的 weighted voting。
$\hat{y}=\sum_{j} \theta_{j} y_j$

overall loss function.
根据 DAC 理论，通过减小在 $\mathcal{M_{Z_j,S}}$ 上的分类误差 & $\mathcal{M_{Z_j, S}}$ ， $\mathcal{M_{Z_j, T}}$ 之间的距离，可以减少在第 $j$ 个通道上 target label 的预测误差。因此第 $j$ 个通道 $Z_j$ 的 loss function $L_{Z_j, D}= CE( \hat{y}_{j,S} , y_S)$
$L_{Z_j, T}= MMD(M_{Z_j, S}, M_{Z_j, T})$
$L_{Z_j}= L_{Z_j, D} + \gamma L_{Z_j, T}$
Over all loss 是从 aggregated channel 的 DAC loss 的加权和
$L=\sum_{j} \theta_j L_{Z_j}$

模型细节

组合下标的选择—— 使用 heuristic combination sampling algorithm。
第一次迭代，在两个通道组成的组合中选择 $N - w$ 个 Loss 最小的组合通道作为下次迭代的基础。
第二次迭代，在三个通道组合的组合中选择 $N - w$ 个 Loss 最小的组合通道作为下次迭代的基础。为了减小搜索规模，所有被选择的三个通道需要包含第一步选择的两个通道。
$\cdots$
$\theta$ 的选择：Moving averaged Strategy
(1) 通过 loss value 计算 $\tilde{\theta}$
$\beta_j=- \eta L_{Z_j}$
$\tilde{\theta_j} = \frac{exp \beta_j}{\sum_{i} exp \beta_{i} }$
(2) 更新 $\theta$
$\theta = \alpha \theta + (1-\alpha) \tilde{\theta} , 0< \alpha <1$
初始的 $\theta_j=1/M$ 。

Network Together: Node Classification via Cross-Network Deep Network Embedding——CDNE算法笔记

CDNE 基本思想

使用网络结构来捕捉节点间的临近度——将强连接的节点映射为更相似的 latent vector representation。使用 SAE_s 和 SAE_s 来重构 source network 和 target network 的网络结构临近度矩阵。
使用 node attributes 和 labels 来捕捉不同网络中节点的临近度——将相同 label 的 node 有对齐的 latent vector representation.
其中 SAE_s 是在 source network 上被训练，将相同类的节点映射到更加接近的位置，以便于后续的标签分类任务。
当 SAE_s 收敛或者迭代达到最大值，SAE_s 被固定。然后被 SAE_s 学到的 source network 中的 latent representation 作为训练 SAE_t 的部分输入。SAE_t 的目标就是学习网络不变的节点表达，使得 source network 相同 label 的节点有相同的 latent represnetation。
最后，通过 CDNE 学习 label-discriminative and network-invariant 节点表达。

符号说明

在这里插入图片描述
$\mathcal{G^{s}}=(V^s,E^s,Y^s)$ 为 fully labeled source network, $V^s$ 为 a set of all labeled nodes， $E^s$ 为 a set of edges， $Y^s \in R^{n^s \times C}$ ， $Y^s_{ic}=1$ 说明 node $v_i^s$ 与 label c 有关， $Y^s_{ic}=0$ 说明 node $v_i^s$ 与 label c 无关，一个点可以有多个 label。
$\mathcal{G^{t}}=(V^t,E^t,Y^t)$ 为 insufficiently labeled target network, $V^t=\{ V_{L}^{t}, V_{U}^{t} \}$ ， $V_{L}^{t}$ 为由 labeled node 组成的小的集合， $V_{U}^{t}$ 为由 unlabeled node 组成的大的集合，

CDNE 算法框架

CDNE算法框架

在 Source network 中的 SAE_s。
Preserving Source-Network Structural Proximities .
给定 source network 的 PPMI 矩阵 $X^s \in R^{n^s \times n^s}$ 【根据网络的拓扑结构建立起的矩阵，详见 B 部分】作为输入，一个L层的 SAE_s 如下：
$H^{s(l)} =f (H^{s(l-1)}(W_1^{s(l)}) ^{T} +B_1^{s(l)} ), l=1,\dots, L$
$\hat{H}^{s(l-1)} =f (\hat{H}^{s(l)}(W_2^{s(l)}) ^{T} +B_2^{s(l)} ), l=1,\dots, L$
分别代表 SAE_s 的编码和解码的过程。 $H^{s(0)}=X^s$ ， $H^{s(l)} \in R^{n^s \times d(l)}$ 代表学到的第 $l$ 层 matrix representation, $H^{s(l)}_i\in R^{1\times d(l)}$ 为 $H^{s(l)}$ 的第 $i$ 行，代表 $v_i^s$ 的节点表达。本文中选取激活函数为 sigmoid activation $f(x)=1/(1+e^{-x})$ 。
为了解决网络稀疏的问题，在重建误差中引入 penalty matrix $P^{s(l)}$ ，
$\mathcal{R^{s(l)}}=\frac{1}{2n^s} \Vert P^{s(l)} \odot ( \hat{H}^{s(l-1)} - H^{s(l-1)} ) \Vert_F^2$
where if $H_{ij}^{s(l-1)}>0$ ， $P_{ij}^{s(l)}=\beta >1$ ；if $H_{ij}^{s(l-1)}=0$ ， $P_{ij}^{s(l)}=1$ 。
加入 pairwise constraint 使得强连接的节点有更加相似的 latent node vector representation.
$C^{s(l)}=\frac{1}{2n^s} \sum_{i=1}^{n^s} \sum_{j=1}^{n^s} X_{ij}^{s} \Vert H_{i}^{s(l)}- H_{j}^{s(l)} \Vert ^2.$
Label-Discriminative Representation.
定义矩阵 $O^{s} \in R^{n^s \times n^s}$ 代表两个节点是否有相同的 label。 $O_{ij}^{s}=-1$ ，如果 $v_i^{s}$ 和 $v_{j}^{s}$ 没有共同的 label。 $O_{ij}^{s} \geq 1$ 代表 $v_i^s$ 和 $v_j^s$ 具有共同的 label 的数量。label 可辨别的 pairwise constraint —— 有共同 label 的节点有相似的 vector representation：
$L^{s(l)}=\frac{1}{2n^s} \sum_{i=1}^{n^s} \sum_{j=1}^{n^s} O_{ij}^{s} \Vert H_{i}^{s(l)}- H_{j}^{s(l)} \Vert ^2。$
$L_2$ -norm regularization 防止过拟合
$\Omega ^{s(l)}=(\Vert W_1^{s(l)} \Vert_F+\Vert W_2^{s(l)} \Vert_F)/2.$
Overall loss function
$J^{s(l)}=\mathcal{R^{s(l)}} + \alpha C^{s(l)} + \varphi^{s(l)} L^{s(l)} +\lambda^{s(l)} \Omega^{s(l)}$
在 target network 中的 SAE_t。
Preserving Target-Network Structural Proximities.
与 SAE_s 相同，以 PPMI 矩阵为输入构建编码-解码器，得到每层的 matrix representation。构建了类似的重建误差和 pairwise constraint 使得强连接的节点有更加相似的 latent node vector。
Network-Invariant representations.
在 domain adaptation 中， MMD 是广泛被采用的，无参数的，度量两个 domain 中分布距离的而方法。
首先定义 source network 和 target network 间 empirical marginal MMD：
$\mathcal{M}_{M}^{t(l)} =\frac{1}{2} \Vert \frac{1}{n^s} 1^s H^{s(l)} - \frac{1}{n^t} 1^t H^{t(l)} \Vert^2$
其中 $1^s \in R^{1 \times n^s}$ , $1^t \in R^{1 \times n^t}$ 。通过最小化上式，从 target network 中学到的 node representation 的 marginal distribution 可以与 source network 的 marginal distribution 匹配。
然后，定义 source network 和 target network 间的 class-conditional MMD：
$\mathcal{M_c^{t(l)}} = \sum_{c=1}^{C} \frac{1}{2} \Vert \frac{\sum_{i=1}^{n^t \hat{Y}_{ic}^{t} H_{i}^{t(l)} }}{\sum_{i=1}^{n^t} \hat{Y}_{ic}^{t} }- \frac{\sum_{j=1}^{n^s Y_{jc}^{s} H_{j}^{s(l)} }}{\sum_{j=1}^{n^s} Y_{jc}^{s} } \Vert^2$
最小化上式，代表source network 和 target network 中，每一类的 latent representation 都很接近。
其中 $\hat{Y}^{t}$ 代表 target network 的 predicted label matrix
$\hat{Y}_{ic}^{t}=\begin{cases} Y_{ic}^{t} \in \{0,1\} & v_i^t \in V_{L}^{t}\\ predicted\; probability\; of\; v_i^t \;to\; be\; labeled\; with\; category\; c & v_i^t \in V_{U}^{T}\end{cases}$
同样，为了防止过拟合，构架 $L 2 - n o r m r e g u l a r i z a i t o n$ ，SAE_t 的第 $l$ 层 overall loss function 为：
$J^{t(l)}=\mathcal{R^{t(l)}} + \alpha^{t(l)} C^{t(l)} + \mu \mathcal{M}_{M}^{t(l)} + \gamma^{t(l)} \mathcal{M_c^{t(l)}}+\lambda^{t(l)} \Omega^{t(l)}$
最终的优化算法

实验

Dataset 1. Blog1 和 Blog2 是从BlogCatalog data set [https://github.com/xhuang31/LANE] 中抽取的两个不相交的子网络。一个 node 代表一个 blogger，一条边代表两个 blogger 之间的 friendship。每个点的 attribute 为 blogger’s self-description 中抽取的 keywords 组成，每个点的 label 为 blogger’s interest group。
Dataset 2. Citationv1, DBLPv7, and ACMv9 是从ArnetMiner data set 中抽取的三个引文网络[https://www.aminer.cn/citation]。
从实验中可以看出，CDNE 方法比一般方法效果都好。并且对于噪声有更好的鲁棒性。

Adversarial Deep Network Embedding for Cross-network node classification——ACDNE算法笔记

基本思想

使用两个特征提取器，分别基于自己的 Attribute 和 neighbor’s attribute 来学习节点的表达。因此，attributed affinity 和 topological proximities 被保存。
使用一个节点分类器，使得节点表达可以用于后续辨别标签的任务中。
使用一个 adversarial domain adaptation technique，使得节点表达是网络不变的。

前人研究存在的问题虽然 attributed network embedding algorithm 可以捕捉不同网络中节点间的 proximities，但是没有人考虑网络间的 domain discrepancy。

符号说明

notation

算法框架

论文中算法框架如图所示
在这里插入图片描述

deep network embedding
包含两个 feature extractors，1个 concatenation layer，1个pairwise constraint。
- FE1.
input: each node’s attribute
hidden layers: 有 $l_f$ 个隐藏层
$h_{f_1}^{k} (x_i) = ReLU ( h_{f_1}^{(k-1)}(x_i) W_{f_1}^{(k)} +b_{f_1}^{k} ), 1\leq k \leq l_f$
其中 $h_{f_1}^{(0)} (x_i) =x_i \in R^{1 \times w}$ ，代表 $v_i$ 的 attribute vector。
$x_{ik}$ 为 $v_i$ 的第 $k$ 个 attributed value。
$h_{f_1}^{(k)}(x_i) \in R^{1 \times f(k)}$ 为 FE1 第 k 个隐藏层学到的 $v_i$ 的节点表达。
$W_{f_1}^{(k)}$ 为 trainable weight， $b_{f_1}^{(k)}$ 为 trainable bias parameters。

FE2.
input: neighbors’ attribute
hidden layers: 有 $l_f$ 个隐藏层
$h_{f_2}^{k} (n_i) = ReLU ( h_{f_2}^{(k-1)}(n_i) W_{f_2}^{(k)} +b_{f_2}^{k} ), 1\leq k \leq l_f$
其中 $h_{f_2}^{(0)} (n_i) =n_i \in R^{1 \times w}$ ，代表 $v_i$ 的邻居特征向量。【aggregate the neighbor’s attribute】
$n_{i,k}= \sum_{j=1,j\neq i}^{n} \frac{a_{ij}}{\sum_{g=1, g\neq i}^{n} a_{ig}} x_{jl}$
其中 $a_{ij}$ 代表 $v_i$ 和 $v_j$ 间的拓扑距离，使用PPMI metric 度量 k steps 内节点的 topological proximity。 $a_{ij}$ 越大，说明 $v_i$ 和 $v_j$ 越接近； $a_{ij}=0$ ，说明 $v_i$ 不是 $v_j$ k step 内的邻居。
- Concatenation layer.
$e_i=ReLU([h_{f_1}^{l_f}(x_i), h_{f_2}^{l_f}(n_i)] W_c + b_c)$
$e_i \in R^{1 \times d}$ 为被 ACDNE 最终学到的 node representation。
Pairwise constraint.
$L_p=\frac{1}{n_s} \sum_{v_i,v_j \in V^s} a_{ij} \Vert e_i-e_j \Vert^2 + \frac{1}{n_t} \sum_{v_i,v_j \in V^t} a_{ij} \Vert e_i-e_j \Vert^2$
为了保证点之间的拓扑近似。通过最小化 $L_p$ ，越强连接的点，有越相似的节点表达。
记 trainable parameter: $\theta_{e}=\{ \{ W_{f_1}^{(k)}, b_{f_1}^{(k)}, W_{f_2}^{(k)}, b_{f_2}^{(k)},\}_{k=1}^{l_f}, W_c, b_c \}$

node classifier
在 deep network embedding module 的顶层加入节点分类器：
$\hat{y}_i = \phi(e_i W_y + b_y)$
如果为 multi-class 问题， $\phi(\cdot)$ 为 softmax 函数；如果为 multi-label 问题， $\phi(\cdot)$ 为 sigmoid 函数。
记 trainable parameter: $\theta_y=\{ W_y, b_y \}$ 。
domain discriminator
基本思想：使用 adversarial domain adaptation 使得学习到的 node representation 是 network invariant。
网络结构：将 node representation 输入 domain discriminator 中，预测 node 来自哪个 network 中：
$h_d^{(k)}(e_i)=ReLU(h_d^{(k-1)} (e_i) W_d^{(k)} +b_d^{(k)} )$
$\hat{d}_i=softmax(h_{d}^{l_d} (e_i) W_d^{(l_d+1)} + b_d^{(l_d+1)} )$
其中 $h_d^{(0)}(e_i)=e_i$ ， $h_d^{(k)}(e_i) \in R^{1 \times d(k)}$ 。
记 trainable parameter： $\theta_d=\{ W_d^{(k)}, b_d^{(k)} \}_{k=1}^{l_d+1}$
loss function：
$L_d= -\frac{1}{n^s+n^t} \sum_{v_i \in \{ V^s \cup V^t \} }(1-d_i) \log (1-\hat{d_i}) + d_i \log (\hat{d_i})$
其中 $d_i$ 为 $v_i$ 的 ground truth label
$d_i=\begin{cases} 1 & v_i \in V^t\\ 0 & v_i \in V^s \end{cases} .$
$\hat{d}_i$ 为 $v_i$ 来自 target network 的预测概率。
Joint Training
$\min_{\theta_e,\theta_y} \{ L_y + p L_p +\lambda \max_{\theta_d} \{ -L_d \} \}$
为了可同时更新 embedding module 和 domain discriminator，在 embedding module 和 domain discriminator 间插入一个 Gradient Reverse Layer(GRL)。 GRL 可以将 $L_d$ 对于 $\theta_d$ 的偏导逆过来并且乘以 $\lambda$ 。
ACDNE 可以被SGD 优化：
$\theta_e = \theta_e- \mu (\frac{\partial L_y}{\partial \theta_e}+p \frac{\partial L_p}{\partial \theta_e}- \lambda \frac{\partial L_d}{\partial \theta_e})$
$\theta_y= \theta_y- \mu \frac{\partial L_y}{\partial \theta_y}$
$\theta_d = \theta_d- \mu \frac{\partial L_d}{\partial \theta_d}$
训练过程如 Algorithm 1 所示。

实验

数据集 :

两个不相交的社交网络 Blog1 & Blog2。
点：人。
边： friendshipxxc
attribute: keywords of self-description
label: its joining group
引文网络。

晒太阳的喵喵

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
图上的迁移学习（二）

图上的迁移学习（二）Domain Adaptive Classification on Heterogeneous Information Networks ——MuSDAC 笔记定义MuSDAC 基本思想MuSDAC 算法框架Domain Adaptive Classification on Heterogeneous Information Networks ——MuSDAC 笔记定义异构网络（Heterogeneous Information Network, HIN）：G=(V,ε)\m
复制链接

扫一扫