deepwalk 概率图_使用DeepWalk的图神经网络简介

deepwalk 概率图

介绍 (Introduction)

Graph Neural Networks are the current hot topic [1]. And this interest is surely justified as GNNs are all about latent representation of the graph in vector space. Representing an entity as a vector is nothing new. There are many examples like word2vec and Gloves embeddings in NLP which transforms a word into a vector. What makes such representation powerful are (1) these vectors incorporate a notion of similarity among them i.e. two words who are similar to each other tend to be closer in the vector space (dot product is large), and (2) they have application in diverse downstream problems like classification, clustering, etc. This is what makes GNN interesting, as while there are many solutions to embed a word or image as a vector, GNN laid the foundation to do so for graphs. In this post, we will discuss one of the initial and basic approaches to do so — DeepWalk [2]

图神经网络是当前的热门话题[1]。 由于GNN都是关于向量空间中图形的潜在表示,因此这种兴趣肯定是合理的。 将实体表示为向量并不新鲜。 NLP中有很多示例,如word2vec和Gloves嵌入,可将单词转换为向量。 使得这种表示形式强大的原因是:(1)这些向量在它们之间引入了相似性概念,即,两个彼此相似的词在向量空间中趋近(点积大),并且(2)在向量中有应用。各种各样的下游问题,如分类,聚类等。这就是使GNN变得有趣的原因,因为有许多解决方案可以将单词或图像作为矢量嵌入,但GNN奠定了实现图形的基础。 在本文中,我们将讨论实现此目标的一种初始方法和基本方法-DeepWalk [2]

第101章 (Graphs 101)

Graph or Networks is used to represent relational data, where the main entities are called nodes. A relationship between nodes is represented by edges. A graph can be made complex by adding multiple types of nodes, edges, direction to edges, or even weights to edges.

图或网络用于表示关系数据,其中主要实体称为节点。 节点之间的关系由边表示。 通过添加多种类型的节点,边,边的方向甚至边的权重,可以使图变得复杂。

Image for post
Figure 1: Karate dataset visualization @ Network repository [3].
图1:空手道数据集可视化@网络存储库[3]。

One example of a graph is shown in Figure 1. The graph is the Karate dataset [4] which represents the social information of the members of a university karate club. Each node represents a member of the club, and each edge represents a tie between two members of the club. The left info bar states several graph properties like a number of nodes, edges, density, degree, etc. Network repository [3] contains many such networks from different fields and domains and provides visualization tools and basic stats as shown above.

图的一个示例如图1所示。该图是空手道数据集[4],它表示大学空手道俱乐部成员的社交信息。 每个节点代表俱乐部的一个成员,每个边沿代表俱乐部的两个成员之间的关系。 左侧的信息栏显示了一些图形属性,例如节点数,边数,密度,度等。网络存储库[3]包含许多来自不同字段和域的网络,并提供了可视化工具和基本统计​​信息,如上所示。

相似节点的概念 (Notion of similar nodes)

As the idea behind vector embedding is to highlight similarities, we will consider some definitions for similar nodes. Two nodes can be called similar by several ways, like if they have a similar — in-degree count, out-degree count, avg-degree, or no of neighbors, etc. One interesting notion is to consider the neighbors of nodes and the more common neighbors they share, more similar they are. In plain text, a node is defined by the company it keeps. If two nodes have very similar company, they are very similar. This idea of representing an entity by its locality is not new. The base of word embedding in NLP is based on the motto that, “a word is represented by the context it keeps”. With this much similarity between the two fields, its obvious that the first instinct was to leverage the existing techniques in NLP, port it to graph domains by somehow converting the idea of context of words to neighbors of nodes. One such existing technique is word2vec, which we will discuss briefly.

由于矢量嵌入背后的思想是强调相似性,因此我们将考虑相似节点的一些定义。 可以通过几种方式将两个节点称为相似节点,例如它们的相似度(入度数,出度数,平均度数或无邻居数,等等)。一个有趣的概念是考虑节点和他们共享更多共同的邻居,他们更相似。 用纯文本格式,节点由它所拥有的公司定义。 如果两个节点的公司非常相似,则它们非常相似。 通过实体表示实体的想法并不是什么新鲜事。 NLP中单词嵌入的基础是基于这样的座右铭:“单词由其保留的上下文表示”。 鉴于这两个字段之间的相似性,很明显,第一个本能是利用NLP中的现有技术,通过某种方式将单词上下文的概念转换为节点的邻居,将其移植到图形域。 一种这样的现有技术是word2vec,我们将对其进行简要讨论。

Word2Vec (Word2Vec)

A detour with word2vec (w2v) is required to completely appreciate and understand the idea behind DeepWalk. Word2Vec is a word embedding technique that represents a word as a vector. Each vector can be thought of as a point in $R^{D}$ space, where $D$ is the dimension of each vector. One thing to note is that these vectors are not randomly spread out in the vector space. They follow certain properties such that, words who are similar like cat and tiger are relatively closer to each other than a completely unrelated word like tank. In the vector space, this means their cosine similarity score is higher. Along with this, we can even observe famous analogies like king - man + woman = queen which can be replicated by vector addition of these word's representation vector.

需要绕过word2vec(w2v)才能完全理解和理解DeepWalk背后的想法。 Word2Vec是一种词嵌入技术,可将词表示为向量。 每个向量都可以认为是$ R ^ {D} $空间中的一个点,其中$ D $是每个向量的维数。 要注意的一件事是,这些向量不会在向量空间中随机散布。 它们遵循某些属性,例如,类似老虎的单词比完全无关的单词(如tank)彼此相对更近 在向量空间中,这意味着它们的余弦相似度得分更高。 除此之外,我们甚至可以观察到著名的类比,例如king - man + woman = queen ,可以通过这些单词的表示向量的向量加法来复制。

Image for post
Figure 2: Vector space representing the position of word’s vector and the relationship between them to showcase the analogy king-man+woman=queen
图2:表示单词向量位置及其之间关系的向量空间,以类比国王-男人+女人=女王

While such representation is not unique to w2v, its major contribution was to provide a simple and faster neural network based word embedder. To do so, w2v transformed the training as a classification problem where given one word the networks try to answer which word is most probable to be found in the context of the given word. This technique is formally called Skip-gram, where input is the middle word and output is context word. This is done by creating a 1-layer deep neural network where the input word is fed in one-hot encoded format and output is softmax with ideally large value to context word.

尽管这种表示形式不是w2v独有的,但其主要贡献是提供了一个简单,快速的基于神经网络的词嵌入器。 为此,w2v将训练转变为一个分类问题,其中给定一个单词,网络会尝试回答在给定单词的上下文中最有可能找到哪个单词。 这项技术正式称为Skip-gram,其中输入是中间词,输出是上下文词。 这是通过创建一个1层深度神经网络来完成的,在该网络中,输入词以一种热编码的格式馈入,并且输出是softmax,具有理想的上下文词值。

Image for post
Figure 3: SkipGram architecture (taken from Lil’Log [7]). Its a 1 layer deep NN with input and output as one-hot encoded. The input-to-hidden weight matrix contains the word embeddings.
图3:SkipGram架构(摘自Lil'Log [7])。 它是1层深度的NN,输入和输出均为一键编码。 输入到隐藏权重矩阵包含单词嵌入。

The training data is prepared by sliding a window (of some window size) across the corpus of large text (which could be articles or novels or even complete Wikipedia), and for each such window the middle word is the input word and the remaining words in the context are output words. For each input word in vector space, we want the context words to be close but the remaining words far. And if two input words will have similar context words, their vector will also be close. This is the intuition behind Word2Vec which it does by using negative sampling. After training we can observe something interesting — the weights between the Input-Hidden layer of NN now represent the notions we wanted in our word embeddings, such that words with the same context have similar values across vector dimension. And these weights are used as word embeddings.

通过在大文本(可能是文章或小说甚至是完整的维基百科)的语料库上滑动一个窗口(具有某个窗口大小)来准备训练数据,并且对于每个这样的窗口,中间单词是输入单词,其余单词是在上下文中是输出字。 对于向量空间中的每个输入单词,我们希望上下文单词尽可能接近,而其余单词则相对较远。 并且,如果两个输入词将具有相似的上下文词,则它们的向量也将接近。 这是Word2Vec背后的直觉,它使用负采样来完成。 训练后,我们可以观察到一些有趣的东西-NN的Input-Hidden层之间的权重现在代表了我们在词嵌入中所需的概念,因此,具有相同上下文的词在向量维度上的值相似。 这些权重被用作单词嵌入。

Image for post
Figure 4: Heatmap visualization of 5D word embeddings from Wevi [5]. Color denotes the value of cells.
图4:来自Wevi的5D单词嵌入的热图可视化[5]。 颜色表示单元格的值。

The result in Figure 4 is from training 5D word embeddings from a cool interactive w2v demo Wevi [5]. As obvious words like (juice, milk, water) and (orange, apple) have similar kinds of vectors (some dimensions are equally lit — red or blue). Interested readers can go to [7] for a detailed understanding of the architecture and maths. Also [5] is suggested for excellent visualization of the engine behind word2vec.

图4的结果来自训练一个很酷的交互式w2v演示Wevi [5]的5D单词嵌入。 诸如(果汁,牛奶,水)和(橙色,苹果)之类的明显词语具有类似的向量(某些维被相等地点亮-红色或蓝色)。 有兴趣的读者可以前往[7],以详细了解架构和数学。 还建议使用[5]对word2vec背后的引擎进行出色的可视化。

深度漫步 (DeepWalk)

DeepWalk employs the same training technique as of w2v i.e. skip-gram. But one important thing remaining is to create training data that captures the notion of context in graphs. This is done by random walk technique, where we start from one node and randomly go to one of its neighbors. We repeat this process $L$ time which is the length of the random walk. After this, we restart the process again. If we do this for all nodes (and $M$ times for each node) we have in some sense transformed the graph structure into a text like corpus used to train w2v where each word is a node and its context defines its neighbor.

DeepWalk采用了与w2v相同的训练技术,即跳过语法。 但是剩下的重要一件事是创建训练数据,以捕获图形中上下文的概念。 这是通过随机游走技术完成的,我们从一个节点开始,然后随机到达其邻居之一。 我们重复此过程$ L $时间,即随机游走的时间。 此后,我们再次重新启动该过程。 如果我们对所有节点执行此操作(每个节点执行$ M $倍),则从某种意义上说,我们已将图形结构转换为文本,如用于训练w2v的语料库,其中每个词都是一个节点,其上下文定义了它的邻居。

实施1:作者代码 (Implementation 1: Author’s code)

The DeepWalk authors provide a python implementation here. Installation details with other pre-requisite are provided in the readme (windows user be vary of some installation and execution issues). The CLI API exposes several algorithmic and optimization parameters like,

DeepWalk作者在此处提供python实现。 自述文件中提供了具有其他必备组件的安装详细信息(Windows用户可能会因某些安装和执行问题而有所不同)。 CLI API公开了几种算法和优化参数,例如,

  • input requires the path of the input file which contains graph information. A graph can be stored in several formats, some of the famous (and supported by the code) are — adjacency list (node-all_neighbors list) and edge list (node-node pair which have an edge).

    input需要包含图形信息的输入文件的路径。 图可以以几种格式存储,其中一些著名的(并受代码支持)是-邻接表(node-all_neighbors列表)和边列表(具有边的节点-节点对)。

  • number-walks The number of random walks taken for each node.

    number-walks每个节点的随机游走次数。

  • representation-size the dimension of final embedding of each node. Also the size of hidden layer in the skipgram model.

    representation-size -每个节点最终嵌入的尺寸。 跳过图模型中隐藏层的大小。

  • walk-length the length of each random walk.

    walk-length每次随机游走的长度。

  • window-size the context window size in the skipgram training.

    window-size在跳过训练中的上下文窗口大小。

  • workers optimization parameter defining number of independent process to spawn for the training.

    workers优化参数,该参数定义了要为培训生成的独立过程的数量。

  • output the path to output embedding file.

    output输出嵌入文件的路径。

Authors have also provided example graphs, one of which is our Karate club dataset. It's stored in the format of the adjacency list.

作者还提供了示例图,其中之一就是我们的空手道俱乐部数据集。 它以邻接表的格式存储。

Image for post
Figure 5: Initial 5 rows of the adjacency list of Karate club dataset. Nodes are represented as numbers. In each row, the first node name is the central node and the remaining nodes are its neighbor (they have an edge).
图5:空手道俱乐部数据集的邻接列表的前5行。 节点用数字表示。 在每一行中,第一个节点名称是中心节点,其余节点是其邻居(它们具有边)。

Now let’s read the graph data and create node embeddings by,

现在,让我们读取图形数据并通过以下方式创建节点嵌入:

deepwalk --input example_graphs/karate.adjlist --number-walks 10
--representation-size 64 --walk-length 40 --window-size 5
--workers 8 --output trained_model/karate.embeddings

This performs start-to-end analysis by taking care of — loading the graph from the file, generating random walks, and finally training skip-gram model on the walk data. By running this with additional --max-memory-data-size 0 param, the script also stores the walk data as shown below.

通过执行以下工作来执行从头到尾的分析:从文件中加载图形,生成随机游历,最后训练游历数据上的跳过语法模型。 通过使用附加的--max-memory-data-size 0参数运行此脚本,脚本还将存储步行数据,如下所示。

Image for post
Figure 6: Initial 5 lines of the random walk corpus generated for Karate club dataset. Nodes are represented as numbers. Each line represents 1 random walk starting from the first node. As we set walk length = 40, there are 40 nodes per line (walk). Also as we set number walks = 10 and total nodes = 34, a total of 10*34=340 random walks are generated.
图6:为空手道俱乐部数据集生成的随机游走语料的前5行。 节点用数字表示。 每行代表从第一个节点开始的1次随机游走。 当我们设置步行长度= 40时,每条线(步行)有40个节点。 同样,当我们设置步数= 10且节点总数= 34时,将生成总共10 * 34 = 340个随机步数。

Finally, we get the output embedding file which contains vector embedding for each node in the graph. The file looks as,

最后,我们获得输出嵌入文件,该文件包含针对图中每个节点的向量嵌入。 该文件看起来像是

Image for post
Figure 7: Initial 3 rows of node embedding output. The first line is a header with node and embedding dimension stats. From the second line onwards, the first number is the node name and the subsequent numbers are the vector embedding of the mentioned node.
图7:节点嵌入输出的前3行。 第一行是带有节点和嵌入维度统计信息的标头。 从第二行开始,第一个数字是节点名称,随后的数字是所提到节点的向量嵌入。

实施2:空手道俱乐部 (Implementation 2: Karate club)

A much simpler API is provided by newly released python implementation — KarateClub [6]. To do the same set of actions, all we need to do is following.

新发布的python实现提供了一个更简单的API-KarateClub [6]。 要执行同一组动作,我们需要做的就是遵循。

# import libraries
import networkx as nx
from karateclub import DeepWalk
# load the karate club dataset
G = nx.karate_club_graph()
# load the DeepWalk model and set parameters
dw = DeepWalk(dimensions=64)
# fit the model
dw.fit(G)
# extract embeddings
embedding = dw.get_embedding()

The DeepWalk class also extends the same parameters exposed by the author's code and can be tweaked to do the desired experiment.

DeepWalk类还扩展了作者代码公开的相同参数,可以进行调整以进行所需的实验。

Image for post
Figure 8: Parameters exposed by DeepWalk implementation in KarateClub [6].
图8:KarateClub [6]中DeepWalk实现公开的参数。

实验性 (Experimentation)

To see DeepWalk in action, we will pick one graph and visualize the network as well as the final embeddings. For better understanding, I created a union of 3 complete graphs with some additional edges to connect each graph.

要查看DeepWalk的实际效果,我们将选择一个图形并可视化网络以及最终的嵌入。 为了更好地理解,我创建了3个完整图形的并集,并带有一些其他边来连接每个图形。

Image for post
Figure 9: Union of 3 complete graphs. We can imagine 3 clusters with nodes 0 to 9 belonging to cluster 1; 10 to 19 to cluster 2 and 20 to 28 in cluster 3.
图9:3个完整图形的并集。 我们可以想象3个群集,其中节点0到9属于群集1; 集群2的10到19和集群3的20到28。

Now, we will create DeepWalk embeddings of the graph. For this, we can use the KarateClub package and by running DeepWalk on default settings we get embeddings of 128 dimensions. To visualize this I use dimensionality reduction technique PCA, which scaled-down embeddings from R¹²⁸ to R². I will also plot the 128D heatmap of the embedding on the side.

现在,我们将创建图的DeepWalk嵌入。 为此,我们可以使用KarateClub软件包,并通过在默认设置上运行DeepWalk来获得128个尺寸的嵌入。 为了显示这一点,我使用降维技术PCA,该技术将嵌入从R 12缩小到R 2。 我还将在侧面绘制嵌入的128D热图。

Image for post
Figure 10: Left — The PCA reduced (from 128D to 2D) node embeddings of the graph. Right — The heatmap of the original 128D embeddings.
图10:左-图的PCA减少了(从128D到2D)节点嵌入。 右—原始128D嵌入的热图。

There is a clear segregation of nodes in the left chart which denotes the vector space of the embedding. This showcase how DeepWalk can transform a graph from force layout visualization to vector space visualization while maintaining some of the structural properties. The heatmap plot also hints to a clear segregation of graph into 3 clusters.

左侧图表中的节点明显分离,这表示嵌入的向量空间。 这展示了DeepWalk如何在保持某些结构特性的同时将图形从力布局可视化转换为矢量空间可视化。 热图图还暗示将图清楚地分成3个簇。

Another important thing to note is when the graph is not so complex, we can get by with lower dimension embedding as well. This not only reduces the dimensions but also improves the optimization and convergence as there are fewer parameters in skip-gram to train. To prove this we will create embedding of only size 2. This can be done by setting the parameter in DeepWalk object dw = DeepWalk(dimensions=2) . We will again visualize the same plots.

另一个要注意的重要事项是,当图形不是那么复杂时,我们也可以通过较小的尺寸嵌入来实现。 这不仅减少了尺寸,而且还改善了优化和收敛性,因为在跳过图中需要训练的参数较少。 为了证明这一点,我们将仅创建大小为2的嵌入。这可以通过在DeepWalk对象dw = DeepWalk(dimensions=2)设置参数来完成。 我们将再次可视化相同的图。

Image for post
Figure 11: Left: The node embeddings (size=2) of the graph. Right: The heatmap of the embeddings.
图11:左:图的节点嵌入(大小= 2)。 右:嵌入的热图。

Both the plots again hint towards the same number of clusters in the graph, and all this by only using 1% of the previous dimensions (from 128 to 2 i.e. ~1%).

这两个图都再次暗示图中的簇数相同,而所有这些都仅使用了先前尺寸的1%(从128到2,即〜1%)。

结论 (Conclusion)

As the answer to this analogy NLP - word2vec + GraphNeuralNetworks = ? can arguably be DeepWalk (is it? 🙂 ), it leads to two interesting points, (1) DeepWalk's impact in GNN can be analogous to Word2Vec's in NLP. And it's true as DeepWalk was one of the first approaches to use NN for node embeddings. It was also a cool example of how some proven SOTA technique from another domain (here, NLP) can be ported to and applied in a completely different domain (here, graphs). This leads to the second point, (2) As DeepWalk was published a while ago (in 2014 - only 6 years but a lifetime in AI research), currently, there are lots of other techniques which can be applied to do the job in a better way like Node2Vec or even Graph convolution networks like GraphSAGE, etc. That said, as to start with NN based NLP, word2vec is the best starting point, I think DeepWalk is in the same sense a good beginning for NN based graph analysis. And hence the topic of this article.

作为此类比喻的答案NLP - word2vec + GraphNeuralNetworks = ? 可以说是DeepWalk(是吗?it),它引出两个有趣的观点:(1)DeepWalk在GNN中的影响可以类似于Word2Vec在NLP中的影响。 确实如此,因为DeepWalk是使用NN进行节点嵌入的首批方法之一。 这也是一个很酷的例子,说明了如何将来自另一个领域(此处为NLP)的某些经过验证的SOTA技术移植到并应用于完全不同的领域(此处为图形)。 这引出了第二点:(2)DeepWalk是在不久前发布的(2014年-才6年,但AI研究终其一生),目前,可以应用许多其他技术来完成更好的方法,例如Node2Vec甚至是Graph卷积网络(例如GraphSAGE等)。那就是说,从基于NN的NLP开始,word2vec是最好的起点,我认为DeepWalk在相同的意义上是基于NN的图分析的良好起点。 因此,本文的主题。

Cheers.

干杯。

翻译自: https://towardsdatascience.com/introduction-to-graph-neural-networks-with-deepwalk-f5ac25900772

deepwalk 概率图

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值