node2vec原理_node2vec的工作原理-word2vec无法做到的

最新推荐文章于 2024-07-10 17:03:57 发布

cumifi2519

最新推荐文章于 2024-07-10 17:03:57 发布

阅读量655

点赞数

文章标签：算法 python 机器学习人工智能 java

原文链接：https://www.freecodecamp.org/news/how-to-think-about-your-data-in-a-different-way-b84306fc2e1d/

版权

本文介绍了node2vec的工作原理，它利用深度学习自动从图数据中提取节点特征，弥补了word2vec在处理非线性数据时的不足。通过对图进行随机游走，将图形结构转化为线性序列，然后应用word2vec来学习节点的嵌入。案例研究表明，node2vec在内容推荐系统中能够发现节点间的相似性，形成同质的聚类。

摘要由CSDN通过智能技术生成

node2vec原理

by Zohar Komarovsky

由Zohar Komarovsky

node2vec的工作原理-word2vec无法做到的 (How node2vec works — and what it can do that word2vec can’t)

如何以不同的方式考虑您的数据 (How to think about your data differently)

In the last couple of years, deep learning (DL) has become the main enabler for applications in many domains such as vision, NLP, audio, clickstream data etc. Recently, researchers started to successfully apply deep learning methods to graph datasets in domains like social networks, recommender systems, and biology, where data is inherently structured in a graphical way.

在过去的两年中，深度学习(DL)已成为视觉，NLP，音频，点击流数据等许多领域应用的主要推动力。最近，研究人员开始成功地将深度学习方法应用于诸如以下领域的图形数据集社交网络，推荐系统和生物学，其中数据以图形方式固有地结构化。

So how do Graph Neural Networks work? Why do we need them?

那么图神经网络如何工作？我们为什么需要它们？

深度学习的前提 (The Premise of Deep Learning)

In machine learning tasks that involve graphical data, we usually want to describe each node in the graph in a way that allows us to feed it into some machine learning algorithm. Without DL, one would have to manually extract features, such as the number of neighbors a node has. But this is a laborious job.

在涉及图形数据的机器学习任务中，我们通常希望以一种允许将其馈入某种机器学习算法的方式来描述图中的每个节点。如果没有DL，则必须手动提取特征，例如节点具有的邻居数。但这是一项艰巨的工作。

This is where DL shines. It automatically exploits the structure of the graph in order to extract features for each node. These features are called embeddings.

这就是DL的魅力所在。它会自动利用图的结构以提取每个节点的特征。这些功能称为嵌入。

The interesting thing is, that even if you have absolutely no information about the nodes, you can still use DL to extract embeddings. The structure of the graph, that is — the connectivity patterns, hold viable information.

有趣的是，即使您完全没有节点信息，您仍然可以使用DL提取嵌入。图的结构(即连接模式)保存可行的信息。

So how can we use the structure to extract information? Can the context of each node within the graph really help us?

那么我们如何使用该结构来提取信息呢？图中每个节点的上下文真的可以为我们提供帮助吗？

从上下文中学习 (Learning from Context)

One well-known algorithm that extracts information about entities using context alone is word2vec. The input to word2vec is a set of sentences, and the output is an embedding for each word. Similarly to the way text describes the context of each word via the words surrounding it, graphs describe the context of each node via neighbor nodes.

word2vec是一种仅使用上下文提取有关实体的信息的著名算法。 word2vec的输入是一组句子，而输出是每个单词的嵌入。类似于文本通过其周围的单词描述每个单词的上下文的方式，图形通过邻居节点描述每个节点的上下文。

While in text words appear in linear order, in graphs it’s not the case. There’s no natural order between neighbor nodes. So we can’t use word2vec… Or can we?

尽管文字在文本中以线性顺序出现，但在图形中却并非如此。邻居节点之间没有自然顺序。所以我们不能使用word2vec…还是可以？

像坏蛋数学家一样还原 (Reduction like a Badass Mathematician)

We can apply reduction from the graphical structure of our data into a linear structure such that the information encoded in the graphical structure isn’t lost. Doing so, we’ll be able to use good old word2vec.

我们可以将数据的图形结构简化为线性结构，以使图形结构中编码的信息不会丢失。这样做，我们就可以使用旧的word2vec。

The key point is to perform random walks in the graph. Each walk starts at a random node and performs a series of steps, where each step goes to a random neighbor. Each random walk forms a sentence that can be fed into word2vec. This algorithm is called node2vec. There are more details in the process, which you can read about in the original paper.

关键是在图中执行随机游走。每次步行都从一个随机节点开始，并执行一系列步骤，其中每个步骤都将到达一个随机邻居。每个随机游走都会构成一个句子，可以将其输入word2vec。该算法称为node2vec 。该过程中有更多详细信息，您可以在原始论文中阅读。

案例分析 (Case study)

Taboola’s content recommender system gathers lots of data, some of which can be represented in a graphical manner. Let’s inspect one type of data as a case study for using node2vec.

Taboola的内容推荐系统收集大量数据，其中一些可以图形方式表示。让我们检查一种数据类型，作为使用node2vec的案例研究。

Taboola recommends articles in a widget shown in publishers’ websites:

Taboola推荐发布者网站中显示的小部件中的文章：

Each article has named entities — the entities described by the title. For example, the item “the cutest dogs on the planet” contains the entities “dog” and “planet”. Each named entity can appear in many different items.

每篇文章都有命名的实体，即标题所描述的实体。例如，项目“地球上最可爱的狗”包含实体“狗”和“行星”。每个命名实体可以出现在许多不同的项目中。

We can describe this relationship using a graph in the following way: each node will be a named entity. There will be an edge between two nodes if the two named entities appear in the same item:

我们可以通过以下方式使用图形描述这种关系：每个节点将是一个命名实体。如果两个命名实体出现在同一项目中，则两个节点之间将存在一条边：

Now that we are able to describe our data in a graphical manner, let’s run node2vec to see what insights we can learn out of the data. You can find the working code here.

既然我们能够以图形方式描述数据，那么让我们运行node2vec来看看我们可以从数据中学到什么见识。您可以在此处找到工作代码。

After learning node embeddings, we can use them as features for a downstream task, e.g. CTR (Click Through Rate) prediction. Although it could benefit the model, it’ll be hard to understand the qualities learned by node2vec.

在学习节点嵌入之后，我们可以将它们用作下游任务的功能，例如CTR(点击率)预测。尽管它可以使模型受益，但是很难理解node2vec所学到的质量。

Another option would be to cluster similar embeddings together using K-means, and color the nodes according to their associated cluster:

另一种选择是使用K-means将相似的嵌入物聚类在一起，并根据节点的关联聚类为节点着色：

Cool! The clusters captured by node2vec seem to be homogeneous. In other words, nodes that are close to each other in the graph are also close to each other in the embedding space. Take for instance the orange cluster — all of its named entities are related to basketball.

凉！ node2vec捕获的群集似乎是同质的。换句话说，图中的彼此靠近的节点在嵌入空间中也彼此靠近。以橙色集群为例-它的所有命名实体都与篮球有关。

You might wonder what the benefit of using node2vec over classical graphical algorithms, such as community detection algorithms (e.g., the Girvan-Newman algorithm) is. Capturing the community each node belongs to can definitely be done using such algorithms, there’s nothing wrong with it.

您可能想知道，使用node2vec优于传统的图形算法(例如社区检测算法(例如， Girvan-Newman算法 ))有什么好处。捕获每个节点所属的社区绝对可以使用这种算法来完成，这没有错。

Actually, that’s exactly feature engineering. And we already know that DL can save you the time of carefully handcrafting such features. So why not enjoy this benefit? We should also keep in mind that node2vec learns high dimensional embeddings. These embeddings are much richer than merely community belonging.

实际上，这就是功能工程。我们已经知道DL可以节省您精心制作此类功能的时间。那么为什么不享受这种好处呢？我们还应该记住，node2vec学习高维嵌入。这些嵌入比仅仅是社区归属更丰富。

采取另一种方法 (Taking Another Approach)

Using node2vec in this use case might not be the first idea that comes to mind. One might suggest to simply use word2vec, where each sentence is the sequence of named entities inside a single item. In this approach, we don’t treat the data as having a graphical structure. So what’s the difference between this approach — which is valid, and node2vec?

在这种用例中使用node2vec可能不是想到的第一个想法。可能有人建议简单地使用word2vec，其中每个句子都是单个项目中命名实体的序列。在这种方法中，我们不会将数据视为具有图形结构。那么，这种方法(有效)和node2vec有什么区别？

If we think about it, each sentence we generate in the word2vec approach is a walk in the graph we’ve defined earlier. node2vec also defines walks on the same graph. So they are the same, right? Let’s have a look at the clusters we get by the word2vec approach:

如果我们考虑一下，在word2vec方法中生成的每个句子都是我们在前面定义的图中的遍历。 node2vec还定义了同一图形上的游走。所以他们是一样的，对吗？让我们看一下通过word2vec方法获得的集群：

Now the “basketball” cluster is less homogenous — it contains both orange and blue nodes. The named entity “Basketball” for example was colored orange. While the basketball players “Lebron James” and “Kobe Bryant” were colored blue!

现在，“篮球”群集的同质性降低了-它包含橙色和蓝色节点。例如，命名实体“篮球”的颜色为橙色。篮球运动员“勒布朗·詹姆斯”和“科比·布莱恩特”被涂成蓝色！

But why did this happen?

但是为什么会这样呢？

In this approach, each walk in the graph is composed only of named entities that appear together in a single item. It means we are limited to walks that don’t go further than distance 1 from the starting node. In node2vec, we don’t have that limit. Since each approach uses a different kind of walks, the learned embeddings capture a different kind of information.

用这种方法，图中的每个游动仅由在单个项目中一起出现的命名实体组成。这意味着我们仅限于步行距离起点不超过距离1的步行。在node2vec中，我们没有该限制。由于每种方法使用不同类型的走动，因此学习到的嵌入会捕获不同类型的信息。

To make it more concrete, consider the following example. Say we have two items — one with named entities A, B, C and another with D, B, E. These items induce the following graph:

为了更具体，请考虑以下示例。假设我们有两个项目-一个项目的名称为A，B，C，另一个项目的名称为D，B，E。这些项目产生以下图形：

In the simple word2vec approach we’ll generate the following sentences: [A, B, C] and [D, B, E]. In the node2vec approach, we could also get sentences like [A, B, E]. If we fetch the latter into the training process, we’ll learn that E and C are interchangeable. The prefix [A, B] will be able to predict both C and E. Therefore, C and E will get similar embeddings, and will be clustered together.

在简单的word2vec方法中，我们将生成以下句子：[A，B，C]和[D，B，E]。在node2vec方法中，我们还可以获得类似[A，B，E]的句子。如果将后者带入训练过程，我们将了解到E和C是可互换的。前缀[A，B]将能够预测C和E。因此，C和E将获得相似的嵌入，并将被聚类在一起。

外卖 (Takeaways)

Using the right data structure to represent your data is important. Each data structure implies a different learning algorithm. Or in other words — introduces a different inductive bias.

使用正确的数据结构来表示数据非常重要。每个数据结构都包含不同的学习算法。换句话说-引入了不同的归纳偏置。

Identifying your data has a certain structure, so you can use the right tool for the job, might be challenging.

识别数据具有一定的结构，因此您可以使用正确的工具来完成这项工作，这可能很困难。

Since so many real-world datasets are naturally represented as graphs, we think Graph Neural Networks are a must-have in our toolbox as data scientists.

由于许多现实世界的数据集自然地以图形表示，因此我们认为图形神经网络是数据科学家中工具箱中的必备工具。

Originally published at engineering.taboola.com by me and Yoel Zeldes.

最初由我和Yoel Zeldes发表在engineering.taboola.com上 。