深度之眼Paper带读笔记GNN.04.metapath2vec

最新推荐文章于 2024-07-26 09:15:47 发布

oldmao_2000

最新推荐文章于 2024-07-26 09:15:47 发布

阅读量3.3k

点赞数 11

分类专栏： # 图神经网络GNN（完结）文章标签：深度学习算法机器学习图神经网络

本文链接：https://blog.csdn.net/oldmao_2001/article/details/108314624

版权

图神经网络GNN（完结）专栏收录该内容

13 篇文章 42 订阅

订阅专栏

前言

本课程来自深度之眼，部分截图来自课程视频。

文章标题：metapath2vec:Scalable Representation Learning for Heterogeneous Networks
结构化深度网络特征表示
作者：Yuxiao Dong.etc（董玉箫？）
单位：Microsoft Research
发表会议及时间：KDD 2017（KDD比较偏实验）
公式输入请参考：在线Latex公式
降采样：http://d0evi1.com/word2vec-subsampling/

不同前面3篇文章，这篇是针对异质图进行的研究，同质图算法不考虑节点类型，因此有如下两方面的缺点：
1.容易偏向于出现频率高的节点类型；
2.偏向于连接相对集中的节点（即度数高的节点）
因此其他那些节点会训练不充分，就好比普通数据里面样本不均衡带来的一些问题。

论文结构

Abstract：提出基于传统图的图表征学习算法无法很好的应用到点和边有多个类型的异质图，引出本文的算法metapath2vec（++：这个版本就是加强版）。
Introduction：之前的算法多集中在研究同质网络如deepwalk、LINE、node2vec等，引出异质图中的多类型的点和边所存在的广泛应用，以及难点和挑战性。
Related Work：基于邻接表矩阵分解的网络表征算法（如果不降维的话）计算开销大，效果不理想；比较了基于deepwalk、LINE、node2vec等同质图算法。
Problem Definition：定义异质图上的网络表征学习，表示metapath（这个是异质图的公认定义）。
Metapath2vect：论文算法的模型部分，异质图上的skip2gram算法以及基于metapath的随机游走算法。
Metapath2vec++：异质图上的负采样算法，异质图学习算法的完整框架。
Dataset and Baselines：选取Aminer和DBIS数据集以及DeepWalk、LINE、PTE、邻接表分解等baselines 。
Effectiveness：实验探究模型有效性：节点分类、节点聚类、点相似性、可视化等，参数讨论。
Conclusion：总结提出了一种基于异质图的神经网路框架。

基础知识补充

NMI指标（归一化互信息或标准化互信息Normal Mutual Information）：排序指标，NMI常用在聚类中，度量2个聚类结果的相近程度，值越大越好。用下面的图来看具体例子。
在这里插入图片描述
Purity as an external evaluation criterion for cluster quality. Majority class and number of members of the majority class for the three clusters are:
x, 5(cluster 1);
o, 4(cluster 2);
◇, 3(cluster 3).
Purity is (1/17)×(5+4+3)= 0.71.

比如标准结果是图中的叉叉圈圈块块，分别用1,2,3表示，上面三个聚类的标准结果为：
B=[121111 122223 11333]
如果我们的预测结果是：
A=[111111 222222 33333]
问题：衡量我们的预测结果和标准结果有多大的区别，若我们的预测结果和标准结果的差不多，NMI指标应该为1，若我们的预测做出来的结果很差，NMI指标应该趋近于0。

【百度百科】互信息(Mutual Information)是信息论里一种有用的信息度量，它可以看成是一个随机变量中包含的关于另一个随机变量的信息量，或者说是一个随机变量由于已知另一个随机变量而减少的不肯定性。

NMI先要算MI（互信息Mutual Information）指标：
$I(X;Y)=\sum_{y\in Y}\sum_{x\in X}p(x,y)\text{log}\left(\cfrac{p(x,y)}{p(x)p(y)}\right)$
$p (x, y)$ 是 $x$ 和 $y$ 的联合分布概率，例如： $x = 1, y = 1$ 在结果A和B中出现了5次，总体样本个数为17：
$B=[\underline12\underline{1111}\quad 122223\quad 11333]\\ A=[\underline1 1\underline{1111}\quad 222222\quad 33333]$
那么：
$p(1,1)=5/17,p(1,2)=1/17,p(1,3)=0;\\ p(2,1)=1/17,p(2,2)=4/17,p(2,3)=1/17;\\ p(3,1)=2/17,p(3.2)=0,p(3,3)=3/17;$
分母 $p (x)$ 为 $x$ 的边缘概率函数， $p (y)$ 为 $y$ 的边缘概率函数， $x$ 和 $y$ 分别来自于A和B中的分布，所以即使 $x = y$ 时， $p (x)$ 和 $p (y)$ 也可能是不一样的。
对 $p (x)$ ：
$p (1) = 6 / 17, p (2) = 6 / 17, p (3) = 5 / 17$
对 $p (y)$ ：
$p (1) = 8 / 17, p (2) = 5 / 17, P (3) = 4 / 17$
然后将MI进行归一化：
$NMI(X,Y)=\cfrac{2MI(X,Y)}{H(X)+H(Y)}\tag1$
上式中 $H (X), H (Y)$ 分别是 $X, Y$ 的熵：
$H(X)=-\sum_{i=1}^{|X|}P(i)\text{log}(P(i))\\ H(Y)=-\sum_{j=1}^{|Y|}P'(j)\text{log}(P'(j))$
对于上面的例子，根据公式计算嫡如下：
$H(X)=P(1)\text{log}_2(P(1))+P(2)\text{log}_2(P(2))+P(3)\text{log}_2(P(3))\\ H(Y)=P'(1)\text{log}_2(P'(1))+P('2)\text{log}_2(P'(2))+P'(3)\text{log}_2(P'(3))$
结合上面的结果代入公式1即可得到NMI结果。

研究背景

网络表征学习：从传统的特征工程过渡到基于深度学习的算法
在这里插入图片描述
node attribute inference：节点属性预测
community detection：社群聚类
similarity search：相似搜索
link prediction：相关性预测
social recommendation：社交推荐
本论文基于random walk（红框）+skip2gram（蓝框）的框架

本文研究对象是异质图（深度类开山之作），当然不会像知识图谱中节点和边的类型非常多，这里研究的异质图节点类型就几种，例如下图中有时间、地点、名称等
在这里插入图片描述
下面左图节点有3种类型：author、paper、venue。右图节点有2种类型。

总的来说，一般图有如下几种：同质图（简单图）、

图类型	节点类型	边类型
HOMOGENEOUS	1	1
BIPARTITE	2	1
K-PARTITE	k	k-1
SIGNED	1	2
LABELED	k	l
STAR	k	k-1

本文研究的对象

An academic network：有四个类型的点（注意每个类型点的首字母用在meta path上），边的类型也有几种，例如org和author之间是雇佣关系，author和author之间还有橙色的虚线表示合作关系，文章和文章之间有红色箭头代表引用关系（有向），文章和会议是发表在的关系，作者和文章是发表的关系。
右边是定义的meta path，是人工定义的，就是我们想要关注的那些信息。
在这里插入图片描述

注意meta-path的语意：比如APA：两个作者（Authors）共同合作了一篇论文。

补充一个百度AIstudio的一个课程截图：
在这里插入图片描述
元路径的定义是有一定意义的，例如第一个表示两个不同作者发表同一篇论文；第二个表示不同作者和不同文章发表在同一个会议。
路径选择需要一定的先验知识。
元路径通常是对称的，只要首尾节点类型相同，就可以继续游走（结束游走有两条件，一个是无法找到下一个指定类型的节点，一个是达到最大游走长度）
在这里插入图片描述

模型框架

对于同质图是直接用k维向量来进行表征。
在这里插入图片描述
对于异质图，本文将向量按节点类型进行了切分，然后再分别使用向量表示（下图中的 $k_V,k_A,k_O,k_P$ ）。

下面是metapath2vec++的算法描述：
输入是 $G = (V, E, T)$ ，T表示异质图
花P是自己定义的元路径：meta path（就是上面例子的APA之类的东西）
整个算法分成三个函数，一个是初始化，一个是MetaPathRandomWalk，一个是HeterogeneousSkipGram

MP是每个节点的sequence
在这里插入图片描述

研究意义

将random walk+skip2gram的框架拓展到异质图，如何在多种类型的节点之间定义节点的上下文从而产生好的训练语料
基于异质图的随机游走算法表达了不同类型节点之间的语义和结构关联
早期研究异质图学习的工作，拓展了关于更多类型的网络的图表示学习研究
引用量500+
早期异质图的网络学习的代表性工作，异质图上的经典baseline
拓展了深度学习研究图的类型，进一步将深度学习领域的模型引入更广泛的图学习

研究成果

Multi-label Classification 节点分类
横坐标百分比代表使用多少比例的数据进行训练
venue node分类结果：
在这里插入图片描述
author node分类结果：

超参数对模型有效性实验
w是遍历次数（600最好）
l是遍历深度（60最好）
两个结合起来就是产生w个l长度的sequence。
表征维度d（384最好）

在这里插入图片描述
NMI指标的比较：

可视化结果：
下图是会议节点t-SNE降维的结果，相同类型节点相同颜色。

下面是用PCA降维，在几个baseline的比较：

会议节点相似性排名，越相关越靠上：
下面的图是显示当可用资源（线程）变化，模型性能指标的提升效果，可以看到效果不错，基本线性。
在这里插入图片描述

论文泛读

摘要核心

1.强调之前的模型研究的是同质网络，无法更好地表达点和边的多样性
2.基于异质图上的meta-path设计random walk算法
3.基于skip2gram框架和负采样算法完成异质图的学习
4.通过节点分类、聚类、相似性等任务在Aminer和DBIS两个数据集验证了模型的有效性

论文标题

Introduction
Problem Definition
The merapath2vec framework
3.1Homogeneous Network Embedding
3.2 Heterogeneous Network Embedding: metapath2vec
3.3 metapath2vec++
Experiments
4.1 Experimental Setup
4.2 Multi-Class Classification
4.3 Node Clustering
4.4 Case Study:Similarity Search
4.5 Case Study:Visualization
4.6Scalability
Related Work
Conclusion

论文精读

metapath2vec详解

在这里插入图片描述
挑战性：
1.multiple types of nodes and links 有不同的节点和边
2.difficult to directly apply homogeneous network embedding methods.用同质性网络表征会丢失一部分信息
3.node-neighborhood concept
4.structures and semantics

问题定义

Definition 2.1. A Heterogeneous Network is defined as a graph $G = (V, E, T)$ in which each node $v$ and each link $e$ are associated with their mapping functions $ϕ(v) : V → T_V$ and $φ(e) : E → T_E$ , respectively. $T_V$ and $T_E$ denote the sets of object and relation types, where $T_V | + |T_E | > 2$ .
异质图和同质图的定义不一样，多了一个T，实际上在定义里面就是两个函数，表示节点和边的分类到集合的映射。例如： $ϕ (北京大学) = O r g$ ， $ϕ (张三) = A u t h o r$
由于是异质图，所以有最后节点和边的类型的和要大于2的约束。
异质图表征的定义和同质图表征的定义一样。
Problem 1. Heterogeneous Network Representation Learning: Given a heterogeneous network $G$ , the task is to learn the $d$ dimensional latent representations $\in R^{|V |×d}, d << |V|$ that are
able to capture the structural and semantic relations among them.

细节一：Heterogeneous skip-gram

先有word2vec在同质图的应用：DeepWalk&Node2vec
$\text{arg}\underset{\theta}{\text{max}}\prod_{v\in V}\prod_{c\in N(v)}p(c|v;\theta)$
目标：要找到一个参数 $\theta$ 使得 $v$ 周围的上下文（或者说邻居） $c$ 出现的概率最大
本文在这个基础上引入异质性网络，也就是 $v$ 有很多类，就变成了Heterogeneous skip-gram model：
$\text{arg}\underset{\theta}{\text{max}}\prod_{v\in V}\prod_{t\in T_v}\prod_{c_t\in N_t(v)}p(c_t|v;\theta)$
可以看到改动就是中间加了一项连乘： $\prod_{t\in T_v}$ ，这个 $T_V$ 就是节点 $V$ 的不同类型的集合，例如：the neighborhood of one author node $a_4$ can be structurally close to other authors
(e.g., $a_2$ , $a_3$ & $a_5$ ), venues (e.g., ACL & KDD), organizations (CMU
& MIT（原文能把MIT放进来不太严谨）), as well as papers (e.g., $p_2$ & $p_3$ ).。
在这里插入图片描述

连乘套路就是加log变连加：
$\text{arg}\underset{\theta}{\text{max}}\sum_{v\in V}\sum_{t\in T_v}\sum_{c_t\in N_t(v)}\text{log}p(c_t|v;\theta)$
概率的计算写成公式是(用的softmax)：
$p(c_t|v;\theta)=\cfrac{e^{X_{c_t}\cdot X_v}}{\sum_{u\in V}e^{X_{u}\cdot X_v}}$
注意上面这个公式没有对节点 $u$ 进行分类，在metapath2vec++对这个才进行了处理（在细节三种有讲）。
可以看到分母的累加是要遍历所有节点的，这个计算量太大，本文根据skip-gram的优化方法，引入负采样(推导要看LINE论文)：
$\text{log}\sigma(X_{c_t}\cdot X_v)+\sum_{m=1}^ME_{u^m\sim P(u)}[\text{log}\sigma(-X_{u^m}\cdot X_v)],\text{ where }\sigma(x)=\cfrac{1}{1+e^{-x}}$

细节二：Meta-Path-Based Random Walks

DeepWalk/Node2vec：随机游走（忽视节点的类型）。
meta-path特点：对称性（一般都对称）
在这里插入图片描述
具体定义描述：
1.Given a meta-path scheme，实际上就是上图的抽象表达，从一个节点到下一个节点，只不过节点的类型可以是不一样的。

2.The transition probability at step $i$ is defined as:状态转移矩阵是metapath的核心
下式中 $i$ 代表第 $i$ 个时间步，下标 $t$ 代表节点类型：
$p(v^{i+1}|v_t^i,\mathcal{P})=\begin{cases} \cfrac{1}{|N_{t+1}(v_t^i)|} & \text{ } (v^{i+1},v_t^i)\in E,\phi(v^{i+1})=t+1 \\ \quad\quad0 & \text{ } (v^{i+1},v_t^i)\in E,\phi(v^{i+1})\ne t+1 \\ \quad\quad0 & \text{ } (v^{i+1},v_t^i)\notin E \end{cases}$
一共三种情况，第三种最简单，下一个时间步的节点和当前节点不是邻居关系（没有边相连），则取值为0；
第二种情况：下一个时间步的节点和当前节点是邻居节点（有边相连 $(v^{i+1},v_t^i)\in E$ ），但是下一个时间步的节点的类型和我们在metapath中定义的下一个节点类型不一致（ $\phi$ 功效不记得的同学请往上翻。）则取值为0；
第一种情况：下一个时间步的节点和当前节点是邻居节点（有边相连 $(v^{i+1},v_t^i)\in E$ ），且类型与metapath中定义的类型一致，那么它的转移概率为下一个时间步的节点类型数量分之一，就是概率一样的。

感觉老师口误：应该是下一个时间步与当前节点类型相邻的下一个节点类型的节点数量分之一
例如文章类型节点有3个，但是与 $a_2$ 作者这个节点相邻的文章类型节点只有2个： $p_1,p_2$ ，这里的转移概率为1/2，不是1/3

3.Recursive guidance for random walkers,i.e.,由于metapath是对称的，所以有：
$p(v^{i+1},v_t^i)=p(v^{i+1},v_l^i),\text{ if }t=l$

细节三：softmax and negative sampling

softmax in metapath2vec
$p(c_t|v;\theta)=\cfrac{e^{X_{c_t}\cdot X_v}}{\sum_{u\in V}e^{X_{u}\cdot X_v}}=\cfrac{e^{X_{c_t}}\cdot e^{X_v}}{\sum_{u\in V}e^{X_{u}}\cdot e^{X_v}}\tag2$
softmax in metapath2vec++
$p(c_t|v;\theta)==\cfrac{e^{X_{c_t}}\cdot e^{X_v}}{\sum_{u_t\in V_t}e^{X_{u_t}}\cdot e^{X_v}}\tag3$

原版的metapath2vec随机游走时考虑了节点类型（必须按元路径来），负采样的时候，不用考虑节点类型（随意组合其他类型的节点，不受元路径的限制）。这样其实并不很合理。
因此作者在metapath2vec++中对这一点做了改进，做softmax的时候不是针对所有节点V的，而是对某个类型 $V_t$ 来做softmax。另外公式2中将所有项加起来和为1，公式3将所有项加起来，和不为1，如果节点有4个类型，那么和为4，每一个类型的和为1.
负采样公式上面有，不重复了。

metapath2vec
在这里插入图片描述

metapath2vec++
在这里插入图片描述

细节四 heterogeneous skip-gram node representation

论文的图2caption里面讲得很清楚了，就是按节点类型分别进行表征（红色框框）。注意看每个最后按节点类型进行表征的维度大小和该类型所含节点数量是一样的。
在这里插入图片描述

The heterogeneous skip-gram used in metapath2vec++. Instead of one set of multinomial distributions for all types of neighborhood nodes in the output layer, it specifies one set of multinomial distributions for each type of nodes in $a_4$ ’s neighborhood. $V_t$ denotes one specifies $t$ -type nodes and $V = V_V ∪V_A ∪V_O ∪V_P$ . $k_t$ specifies the size of a particular type of one’s neighborhood and $k = k_V + k_A + k_O + k_P$ .

实验结果及分析

在这里插入图片描述

数据集介绍

两个数据集：Aminer、DBIS，每个数据集中节点种类是3
例如：
Aminer: $V_A|=9,323,739,|V_P|=3,194,405,|V_V|=3,883$
Baselines: DeepWalk，node2vec、LINE、PTE、Spectral Clustering、Graph Factorization
多个任务：节点分类（Multi-Class Classification）、节点聚类（Node Clustering）、相似性（Case Study:Similarity Search）、可视化（Case Study:Visualization）、参数实验（Scalability）

实验参数

(1) The number of walks per node w: 1000;
(2) The walk length l: 100;
(3) The vector dimension d: 128(LINE: 128 for each order);
(4) The neighborhood size k: 7;
(5) The size of negative samples: 5.
meta-path schemes in heterogeneous academic networks are APA and APVPA

Multi-label Classification 节点分类

论文先介绍了数据集中节点的分类label如何获得的。例如会议是通过Google Scholar来匹配：1. Computational Linguistics, 2. Computer Graphics, 3. Computer Networks &
Wireless Communication, 4. Computer Vision & Pattern Recognition, 5. Computing
Systems, 6. Databases & Information Systems, 7. Human Computer Interaction, and 8.
Theoretical Computer Science
venue node分类结果：
在这里插入图片描述
特别说明：在数据比较少的时候，算法结果比baseline效果要特别好（乐夏梗）。

Node clustering 节点聚类

用NMI来衡量，值越大越好。
聚类结果是在得到节点表征后运行k均值算法得到的。
在这里插入图片描述

Case Study: Similarity Search

取21 query nodes，然后用 cosine similarity计算相似度，得到查询结果。
从表中结果看到，自己和自己的相似度肯定最大，所以第一行
而且同档次的会议都排在一起，说明效果还不错。

论文总结

关键点
异质图的理解
meta-path的概念
损失函数的表达
skip-gram一组多项分布

创新点
基于meta-path的随机游走
softmax的修改
负采样的修改
异质图实验论证

启发点
对异质图的理解，多类型的点和边的定义
是异质图表征学习的早期代表性工作，高引用量
random walk +skip2gram的经典框架
算法的设计，将同质图经典框架通过修改损失函数、softmax和负采样，适用到异质图
属于新问题，从图的类型驱动开展的研究工作

代码复现

在这里插入图片描述

DGL

复现使用了DGL：https://github.com/dmlc/dgl
记得先看安装说明，有各种版本的，还有CPU/GPU的。
在这里插入图片描述
下载解压后：

有很多熟悉的名字。。。
先要运行的是sampler.py：
1.生成metapath
2.构图
然后是metapath2vec.py来训练网络，产生embedding
具体可以参考：https://github.com/dmlc/dgl/tree/master/examples/pytorch/metapath2vec

数据集

数据集用的是老师从net_dbis中sample出来的一个子集。单机伤不起。
有五个文件，三个文件是节点，两个文件是边
节点：
id_author.txt：id和作者，以a开头
id_conf.txt：id和会议，以v开头
paper.txt：id和文章名
边：
paper_author.txt
paper_conf.txt
根据上面的dgl说明，可以用sample.py生成一个1.32个G的.metapath文件：output_path_origin.txt
这个玩意太大，跑不动，老师处理了一个.metapath子集：output_path.txt

截取一小部分，metapath应该是v-a-v-a…
vVLDBJ. aKian-LeeTan vDBISP2P aHectorGarcia-Molina vVLDB aEdmondLau vVLDB aDaveLiles vVLDB aG.C.H.Sharman vVLDB aAlexandrosNtoulas vVLDB aJiríZlatuska vVLDB aMasatoshiYoshikawa vIRAL aAitaoChen vTREC aDavidHawking vTREC aTerenceClifton vECIR aDanielHeesch vJCDL aAkiraMaeda vICADL aMohanJohnBlooma vICADL aHyeonjaeCheon vICADL aMohanJohnBlooma vICADL aBillPlummer vICADL aDionHoe-LianGoh vJCDL aBillKules vJCDL aJae-wookAhn vWWW aMartinHalvey vWWW aHerbertRistock vWWW aCongleiYao vWWW aWei-YingMa vWISE aYanboHan vSWDB aIriniFundulaki vWebDB aDimitriosTsoumakos vInfoscale aMatthiasFischmann vInfoscale aYuqingSong vCIVR aMichalHaindl vCIVR aQibinSun vWWW aHaiZhuge vWWW aKimViljanen vWWW aBernadetteBouchon-Meunier vFQAS aMaríaAmparoVilaMiranda vJASIS aPaulB.Kantor vJASIS aJessicaL.Milstead vInf.Process.Manage. aMassimoMelucci vDS-7 aMassimoMelucci vSEBD aSergioFlesca vICDE aShashiShekhar vICDM aNeilDunstan vICDM aHans-PeterKriegel vKnowl.Inf.Syst. aKrishnamoorthySivakumar vKnowl.Inf.Syst. aQianhuiAltheaLiang vInt.J.WebServiceRes. aSaraCorfini vInt.J.WebServiceRes. aBu-SungLee vInt.J.WebServiceRes. aWeiHan vSSDBM aDanielP.Miranker vSIGMODConference aFrançoisBry vWebDB aAlexDekhtyar vJCDL aRayR.Larson vSIGIR aPatriceBellot vSIGIR aHangCui vWWW aDengCai vIEEETrans.Knowl.DataEng. aNobuhisaUeda vTKDD aJiaweiHan vVLDB aVibhorRastogi vVLDB aBengChinOoi vDASFAA aAlanL.Tharp vICDE aJayantR.Haritsa vICDE aMarjorieTempleton vICDE aHaixunWang vICDE aJohnRiedl vACMTrans.Inf.Syst. aPieroFraternali vARTDB aThomasJ.Marlowe vARTDB aIgorR.Viguier vARTDB aKwei-JayLin vSIGMODRecord aMarkA.Roth vSIGMODRecord aJeanT.Anderson vSIGMODRecord aZoéLacroix vDIWeb aJoeWigglesworth vCASCON aUtpalAmin vCASCON aMichaelA.Bauer vCASCON aDerekRayside vCASCON aDavidJ.Taylor vSymposiumonReliabilityinDistributedSoftwareandDatabaseSystems aF.K.Ng vSymposiumonReliabilityinDistributedSoftwareandDatabaseSystems aJ.EliotB.Moss vPOS aHans-JörgSchek vDS-4 aSurajitChaudhuri vACMSIGMODDigitalReview aOuriWolfson vPDIS aVeraChoi vPDIS aCarlaSchlatterEllis vPODS aDoronRotem vICDE aAvigdorGal vIEEETrans.Knowl.DataEng. aBrianR.Gaines vJASIS aQuentinL.Burrell vJASIST aZoranaErcegovac vInf.Process.Manage. aStephenI.Gallant vTREC aSamScott vTREC aKenneyNg vTREC aNicolasMasson vTREC aPaulMcNamee vCLEF aPaoloRosso vCLEF aDanielFerrés vCLEF aGarethJ.F.Jones vCLEF aSyandraSari vCLEF aJohannesLeveling vLWA aMaartenvanSomeren vLWA aThomasGünther vLWA

然后学习embedding：

python metapath2vec.py --path net_dbis/output_path.txt --output_file “result.txt”
打开看看
在这里插入图片描述
第一行表明有68个节点，embedding维度是128.
下面就是embedding结果。

算法模块及细节

sampler.py

import numpy as np
import random
import time
import tqdm
import dgl
import sys
import os

# 每个结点采样多少次
num_walks_per_node = 1000
# 采样长度
walk_length = 100
# 参数：数据所在路径
path = sys.argv[1]


# 创建图的函数，返回四个东西
def construct_graph():
    # 处理节点
    paper_ids = []
    paper_names = []
    author_ids = []
    author_names = []
    conf_ids = []
    conf_names = []

    # 拼接路径，读取三个节点文件
    f_3 = open(os.path.join(path, "id_author.txt"), encoding="ISO-8859-1")
    f_4 = open(os.path.join(path, "id_conf.txt"), encoding="ISO-8859-1")
    f_5 = open(os.path.join(path, "paper.txt"), encoding="ISO-8859-1")
    while True:
        z = f_3.readline()
        if not z:  # 读到结束为止
            break
        z = z.strip().split()  # strip去掉首尾的回车符，然后split切分
        identity = int(z[0])  # 取id
        author_ids.append(identity)  # 存id
        author_names.append(z[1])  # 存name
    while True:
        w = f_4.readline()
        if not w:
            break;
        w = w.strip().split()
        identity = int(w[0])
        conf_ids.append(identity)
        conf_names.append(w[1])
    while True:
        v = f_5.readline()
        if not v:
            break;
        v = v.strip().split()
        identity = int(v[0])
        paper_name = 'p' + ''.join(v[1:])
        paper_ids.append(identity)
        paper_names.append(paper_name)
    f_3.close()
    f_4.close()
    f_5.close()

    #id的转换，将之前的id转换为从0开始的连续值
    author_ids_invmap = {x: i for i, x in enumerate(author_ids)}
    conf_ids_invmap = {x: i for i, x in enumerate(conf_ids)}
    paper_ids_invmap = {x: i for i, x in enumerate(paper_ids)}

    paper_author_src = []
    paper_author_dst = []
    paper_conf_src = []
    paper_conf_dst = []

    #处理边
    f_1 = open(os.path.join(path, "paper_author.txt"), "r")
    f_2 = open(os.path.join(path, "paper_conf.txt"), "r")
    for x in f_1:
        x = x.split('\t')
        x[0] = int(x[0])
        x[1] = int(x[1].strip('\n'))
        #下面两句分别处理边的开始和结束点，因此长度一样，由于上面对id进行过处理，index刚好是id，
        #所以这里添加的id对应的名字。
        paper_author_src.append(paper_ids_invmap[x[0]])
        paper_author_dst.append(author_ids_invmap[x[1]])
    for y in f_2:
        y = y.split('\t')
        y[0] = int(y[0])
        y[1] = int(y[1].strip('\n'))
        paper_conf_src.append(paper_ids_invmap[y[0]])
        paper_conf_dst.append(conf_ids_invmap[y[1]])
    f_1.close()
    f_2.close()

    #构造heterogeneous graph的函数，通过字典进行构造
    #key：'paper', 'pa', 'author'，value：paper_author_src, paper_author_dst，其中pa是边的类型，以此类推
    #具体可以看官网代码介绍
    hg = dgl.heterograph({
        ('paper', 'pa', 'author'): (paper_author_src, paper_author_dst),
        ('author', 'ap', 'paper'): (paper_author_dst, paper_author_src),
        ('paper', 'pc', 'conf'): (paper_conf_src, paper_conf_dst),
        ('conf', 'cp', 'paper'): (paper_conf_dst, paper_conf_src)})
    return hg, author_names, conf_names, paper_names


# "conference - paper - Author - paper - conference" metapath sampling
# 按上面的这个metapath进行生成
def generate_metapath():
    # 拼接路径和文件名output_path.txt，得到最后的路径
    output_path = open(os.path.join(path, "output_path.txt"), "w")
    count = 0

    # 这里返回四个东西
    # hg是异质图的缩写
    # 后面三个看名字就知道是什么，类型是list，index是id。
    hg, author_names, conf_names, paper_names = construct_graph()

    # 重点：产生generate_metapath：conference - paper - Author - paper - conference
    # 是从conference开始的，所以是遍历所有的'conf'的index（id）
    for conf_idx in tqdm.trange(hg.number_of_nodes('conf')):
        #traces是列表的列表，每个列表是一个sequence
        traces, _ = dgl.sampling.random_walk(
            hg, [conf_idx] * num_walks_per_node, metapath=['cp', 'pa', 'ap', 'pc'] * walk_length)

        #对每一个sequence进行处理，把paper去掉（因为之前有论文这样干，这里保持队型）
        for tr in traces:
            outline = ' '.join(
                (conf_names if i % 4 == 0 else author_names)[tr[i]]#这里对4取模刚好把paper去掉
                for i in range(0, len(tr), 2))  # skip paper
            print(outline, file=output_path)
    output_path.close()


if __name__ == "__main__":
    generate_metapath()

reading_data.py

import numpy as np
import torch
from torch.utils.data import Dataset
from download import AminerDataset

np.random.seed(12345)


class DataReader:
    NEGATIVE_TABLE_SIZE = 1e8

    def __init__(self, dataset, min_count, care_type):

        # 初始化变量
        self.negatives = []  # 负采样的样本都放在一个数组中，每次平移n个位置，取出n个元素
        self.discards = []  # 降采样，类似NLP中对于高频定冠词a、the等进行丢弃
        self.negpos = 0  # 记录负采样数据中的当前位置
        self.care_type = care_type
        self.word2id = dict()
        self.id2word = dict()
        self.sentences_count = 0  # sequence的数量
        self.token_count = 0  # 分词后词的数量
        self.word_frequency = dict()  # 词频
        self.inputFileName = dataset.fn
        # 执行函数，给初始化变量赋值
        # 读metapath文件
        self.read_words(min_count)
        self.initTableNegatives()
        self.initTableDiscards()

    # 读取数据并过滤词频小于min_count的节点过滤掉
    def read_words(self, min_count):
        word_frequency = dict()
        for line in open(self.inputFileName, encoding="ISO-8859-1"):
            line = line.split()
            if len(line) > 1:
                # 统计有多少个句子/这里一个句子是一次random walk（metapath sequences）
                self.sentences_count += 1
                for word in line:
                    if len(word) > 0:
                        # 统计总的词数/节点数，包括重复的值
                        self.token_count += 1
                        # 统计词频
                        word_frequency[word] = word_frequency.get(word, 0) + 1
                        # 输出读图过程
                        if self.token_count % 1000000 == 0:
                            print("Read " + str(int(self.token_count / 1000000)) + "M words.")

        wid = 0
        # w是word，c是词频
        for w, c in word_frequency.items():
            if c < min_count:
                continue
            #过滤掉低频词后重新设置id与index的对应关系，wid是新id
            self.word2id[w] = wid
            self.id2word[wid] = w
            # 过滤掉低频词后将词频保存到word_frequency中
            self.word_frequency[wid] = c
            wid += 1

        self.word_count = len(self.word2id)
        print("Total embeddings: " + str(len(self.word2id)))

    def initTableDiscards(self):
        # get a frequency table for sub-sampling. Note that the frequency is adjusted by
        # sub-sampling tricks.
        # word2vec中的降采样subsampling
        # 最高频的词汇，比如in，the，a这些词。这样的词汇通常比其它罕见词提供了更少的信息量。
        # http://d0evi1.com/word2vec-subsampling/
        t = 0.0001
        f = np.array(list(self.word_frequency.values())) / self.token_count
        self.discards = np.sqrt(t / f) + (t / f)

    def initTableNegatives(self):
        # get a table for negative sampling, if word with index 2 appears twice, then 2 will be listed
        # in the table twice.
        # 根据词频先把负样本列出来，例如：self.negatives=[1 1 1 2 2 2 2 3 3 3 4 4 4]
        # 然后再shuffle，然后根据一定长度n取n个词作为负样本
        pow_frequency = np.array(list(self.word_frequency.values())) ** 0.75
        words_pow = sum(pow_frequency)
        # 根据公式计算的ratio是每个词/节点负采样的概率
        ratio = pow_frequency / words_pow
        # 每个词/节点根据ratio看应该分配多少个
        count = np.round(ratio * DataReader.NEGATIVE_TABLE_SIZE)
        # 产生负采样列表，见上面例子[1 1 1 2 2 2 2 3 3 3 4 4 4]
        for wid, c in enumerate(count):
            self.negatives += [wid] * int(c)
        self.negatives = np.array(self.negatives)
        np.random.shuffle(self.negatives)
        self.sampling_prob = ratio#只存未用

    def getNegatives(self, target, size):  # TODO check equality with target
        if self.care_type == 0:
            # negpos初始值为0
            # 取出size个负样本
            response = self.negatives[self.negpos:self.negpos + size]
            # 移动negpos的位置，方便下一次取negative samples，取模是因为当超过长度要循环回到头部
            self.negpos = (self.negpos + size) % len(self.negatives)
            # 处理negatives列表已经遍历到结尾的情况，从头（从0）开始再取一段补上
            if len(response) != size:
                return np.concatenate((response, self.negatives[0:self.negpos]))
        return response


# -----------------------------------------------------------------------------------------------------------------

# 该类继承自PyTorch的Dataset类，要实现三个方法
class Metapath2vecDataset(Dataset):
    def __init__(self, data, window_size):
        # read in data, window_size and input filename
        self.data = data
        self.window_size = window_size
        self.input_file = open(data.inputFileName, encoding="ISO-8859-1")

    def __len__(self):
        # return the number of walks
        # 返回有多少个句子/这里一个句子是一次random walk（metapath sequences）
        return self.data.sentences_count

    def __getitem__(self, idx):
        # return the list of pairs (center, context, 5 negatives)
        #center就是当前节点，context就是邻居节点
        while True:
            line = self.input_file.readline()
            if not line:
                self.input_file.seek(0, 0)#当文件读完的时候，重新回到文件的开头开始读取
                line = self.input_file.readline()

            if len(line) > 1:
                words = line.split()

                if len(words) > 1:
                    # "w in self.data.word2id"：词频>=min_count，意思是读取出line后切成词，然后用词在已经统计过词频的列表中看词是否在列表中，如果在表明该词词频大于min_count
                    # “discards"：满足word2vec中的subsampling
                    word_ids = [self.data.word2id[w] for w in words if
                                w in self.data.word2id and np.random.rand() < self.data.discards[self.data.word2id[w]]]

                    # 打包成一组带训练的数据：（u，v，[n1，n2，n3，n4，n5]）
                    # 其中n1，n2，n3，n4，n5是5个negative sample
                    pair_catch = []
                    for i, u in enumerate(word_ids):
                        # v的范围长度由window_size决定，左看window_size个，右看window_size个
                        # 当然在加了max，防止在最顶端左看的时候越界，设置了下界0
                        for j, v in enumerate(word_ids[max(i - self.window_size, 0):i + self.window_size]):
                            assert u < self.data.word_count
                            assert v < self.data.word_count
                            if i == j:
                                continue
                            pair_catch.append((u, v, self.data.getNegatives(v, 5)))
                    return pair_catch

    # 静态方法，将数据转化成Tensor
    @staticmethod
    def collate(batches):
        all_u = [u for batch in batches for u, _, _ in batch if len(batch) > 0]
        all_v = [v for batch in batches for _, v, _ in batch if len(batch) > 0]
        all_neg_v = [neg_v for batch in batches for _, _, neg_v in batch if len(batch) > 0]

        return torch.LongTensor(all_u), torch.LongTensor(all_v), torch.LongTensor(all_neg_v)

model.py

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import init

"""
    u_embedding: Embedding for center word.
    v_embedding: Embedding for neighbor words.
"""

#继承自PyTorch的nn.Module
class SkipGramModel(nn.Module):

    def __init__(self, emb_size, emb_dimension):
        super(SkipGramModel, self).__init__()
        self.emb_size = emb_size
        self.emb_dimension = emb_dimension
        # N*emb_dimension的矩阵，最终的节点表征结果的保存位置
        self.u_embeddings = nn.Embedding(emb_size, emb_dimension, sparse=True)
        # N*emb_dimension的矩阵，上下文节点表征结果的保存位置
        self.v_embeddings = nn.Embedding(emb_size, emb_dimension, sparse=True)

        #初始化
        initrange = 1.0 / self.emb_dimension
        init.uniform_(self.u_embeddings.weight.data, -initrange, initrange)
        init.constant_(self.v_embeddings.weight.data, 0)

    def forward(self, pos_u, pos_v, neg_v):
        # 取出对应的batch embedding
        # emb_neg_v=[batch,5,dim]
        # emb_u = [batch,dim]
        # emb_u.unsqueeze(2)=[batch,dim，1]
        emb_u = self.u_embeddings(pos_u)
        emb_v = self.v_embeddings(pos_v)
        emb_neg_v = self.v_embeddings(neg_v)

        # score是batch*1的向量
        score = torch.sum(torch.mul(emb_u, emb_v), dim=1)
        score = torch.clamp(score, max=10, min=-10)#做平滑
        score = -F.logsigmoid(score)

        # torch.bmm(emb neg v,emb u.unsqueeze(2))=[batch不变,5*dim,dim*1]
        # torch.bmm(emb_neg_v,emb_u.unsqueeze(2)).squeeze()=[batch,5]
        # neg_score：[batch*5]
        neg_score = torch.bmm(emb_neg_v, emb_u.unsqueeze(2)).squeeze()
        neg_score = torch.clamp(neg_score, max=10, min=-10)
        # neg_score：[batch*1]这里就和socre同维度了。对应原文公式6
        neg_score = -torch.sum(F.logsigmoid(-neg_score), dim=1)

        return torch.mean(score + neg_score)

    def save_embedding(self, id2word, file_name):
        embedding = self.u_embeddings.weight.cpu().data.numpy()
        with open(file_name, 'w') as f:
            # 第一行：节点个数，维度
            f.write('%d %d\n' % (len(id2word), self.emb_dimension))
            # 节点及对应的维度
            for wid, w in id2word.items():
                e = ' '.join(map(lambda x: str(x), embedding[wid]))
                f.write('%s %s\n' % (w, e))

metapath2vec.py

import torch
import argparse
import torch.optim as optim
from torch.utils.data import DataLoader

from tqdm import tqdm#进度条包

from reading_data import DataReader, Metapath2vecDataset
from model import SkipGramModel
from download import AminerDataset, CustomDataset



class Metapath2VecTrainer:

    def __init__(self, args):
        if args.aminer:
            dataset = AminerDataset(args.path)#超大的原配数据集
        else:
            dataset = CustomDataset(args.path)#自定义数据集
        # 读数据
        self.data = DataReader(dataset, args.min_count, args.care_type)
        dataset = Metapath2vecDataset(self.data, args.window_size)
        # zhuanlan.zhihu.com/p/30385675
        # 将大文件按batch读取，而不是一次性读取进来
        self.dataloader = DataLoader(dataset, batch_size=args.batch_size,
                                     shuffle=True, num_workers=args.num_workers, collate_fn=dataset.collate)

        # 设置参数
        self.output_file_name = args.output_file
        self.emb_size = len(self.data.word2id)
        self.emb_dimension = args.dim
        self.batch_size = args.batch_size
        self.iterations = args.iterations
        self.initial_lr = args.initial_lr
        # emb_size：一共多少个节点N；emb_dimension：最后表征维度如：128
        self.skip_gram_model = SkipGramModel(self.emb_size, self.emb_dimension)

        self.use_cuda = torch.cuda.is_available()
        self.device = torch.device("cuda" if self.use_cuda else "cpu")
        if self.use_cuda:
            self.skip_gram_model.cuda()

    def train(self):

        for iteration in range(self.iterations):
            print("\n\n\nIteration: " + str(iteration + 1))
            # adam优化方式
            optimizer = optim.SparseAdam(self.skip_gram_model.parameters(), lr=self.initial_lr)
            scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, len(self.dataloader))

            running_loss = 0.0
            # 按batch训练
            for i, sample_batched in enumerate(tqdm(self.dataloader)):

                if len(sample_batched[0]) > 1:
                    pos_u = sample_batched[0].to(self.device)
                    pos_v = sample_batched[1].to(self.device)
                    neg_v = sample_batched[2].to(self.device)

                    scheduler.step()
                    optimizer.zero_grad()
                    # 计算loss
                    loss = self.skip_gram_model.forward(pos_u, pos_v, neg_v)
                    # 反向传播，更新参数
                    loss.backward()
                    optimizer.step()

                    running_loss = running_loss * 0.9 + loss.item() * 0.1
                    if i > 0 and i % 500 == 0:
                        print(" Loss: " + str(running_loss))

            self.skip_gram_model.save_embedding(self.data.id2word, self.output_file_name)


if __name__ == '__main__':
    # 1.设置模型参数
    # 1）设置模型参数设置输入数据集，图相关参数，skip2gram相关参数，还有模型相关参数，如embedding维度、batch_size等
    # 2）输入输出：
    # 输入文件‘net dbis/output_path.txt'
    # 输出文件'result.txt
    parser = argparse.ArgumentParser(description="Metapath2vec")
    #parser.add_argument('--input_file', type=str, help="input_file")
    parser.add_argument('--aminer', action='store_true', help='Use AMiner dataset')
    parser.add_argument('--path', type=str, help="input_path")
    parser.add_argument('--output_file', type=str, help='output_file')
    parser.add_argument('--dim', default=128, type=int, help="embedding dimensions")
    parser.add_argument('--window_size', default=7, type=int, help="context window size")# skip-gram中的上下文窗口大小，这里默认是7，前后各7个
    parser.add_argument('--iterations', default=5, type=int, help="iterations")
    parser.add_argument('--batch_size', default=50, type=int, help="batch size")
    #0代表:metapath2vec;1代表:metapath2vec++
    parser.add_argument('--care_type', default=0, type=int, help="if 1, heterogeneous negative sampling, else normal negative sampling")
    parser.add_argument('--initial_lr', default=0.025, type=float, help="learning rate")
    parser.add_argument('--min_count', default=5, type=int, help="min count")# 最小词频，在sequence中出现频率小于这个就丢弃
    parser.add_argument('--num_workers', default=16, type=int, help="number of workers")#资源核数量
    args = parser.parse_args()
    m2v = Metapath2VecTrainer(args)
    m2v.train()#开训

答疑

Q：node的context应该怎么理解。是指node的邻居？
邻居不一定是context，前面三篇都是基于随机游走算法的，随机游走之后不是得到一串串序列的，序列上面的节点不是有前后节点么，这些节点就是就是context，context要看两个东西，一个是游走的方式：BSF、DFS、是否回溯、是否要按指定的metapath，另外一个要看窗口大小。